Search and Highlight of Required Substrings in Printed Documents using OCR

  • Krishna A Athul
  • Bharath A Kartha
Keywords: dynamic optical character recognition, substring search in printed documents

Abstract

The implementation of a software application to search for and highlight desired text in a printed document is explained in this paper. An image feed of the hard copy in which search is to be done is given as input to software along with the desired substring whose location is to be identified within document. The program explained in this paper coverts the image to a text document using the Optical Character Recognition (OCR) engine Tesseract, searches through it, highlights the desired substring in the image and displays it, thereby making the detection of its location in the actual hard copy an easy job.

References

[1] R. Smith, "An Overview of the Tesseract OCR Engine," Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Parana, 2007, pp. 629-633. doi: 10.1109/ICDAR.2007.4376991 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4376991&isnumber=4376969
[2] https://www.python.org/
[3] https://pillow.readthedocs.io/en/5.1.x/
[4] https://docs.python.org/3/library/tk.html
[5] https://docs.python.org/2/library/os.html [6] https://pypi.org/project/pytesseract/
[7] https://opensource.google.com/projects/tesseract
[8] https://en.wikipedia.org/wiki/Tesseract_(software)
[9] https://en.wikipedia.org/wiki/TIFF
[10] https://en.wikipedia.org/wiki/BMP_file_format
[11] https://docs.python.org/2/library/sys.html
[12] https://www.reviversoft.com/file-extensions/box
Published
2018-11-02
How to Cite
Athul, K. A., & Kartha, B. A. (2018, November 2). Search and Highlight of Required Substrings in Printed Documents using OCR. ASIAN JOURNAL FOR CONVERGENCE IN TECHNOLOGY (AJCT ) -UGC LISTED, 4(II). Retrieved from http://asianssr.org/index.php/ajct/article/view/562