Search and Highlight of Required Substrings in Printed Documents using OCR

  • Vishnu Nair
  • Krishna A Athul
  • Bharath A Kartha
Keywords: dynamic optical character recognition, substring search in printed documents

Abstract

The implementation of a software application to search for and highlight desired text in a printed document is explained in this paper. An image feed of the hard copy in which search is to be done is given as input to software along with the desired substring whose location is to be identified within document. The program explained in this paper coverts the image to a text document using the Optical Character Recognition (OCR) engine Tesseract, searches through it, highlights the desired substring in the image and displays it, thereby making the detection of its location in the actual hard copy an easy job.

References

[1] R. Smith, "An Overview of the Tesseract OCR Engine," Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Parana, 2007, pp. 629-633. doi: 10.1109/ICDAR.2007.4376991 http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4376991&isnumber=4376969
[2] https://www.python.org/
[3] https://pillow.readthedocs.io/en/5.1.x/
[4] https://docs.python.org/3/library/tk.html
[5] https://docs.python.org/2/library/os.html
[6] https://pypi.org/project/pytesseract/
[7] https://opensource.google.com/projects/tesseract
[8] https://en.wikipedia.org/wiki/Tesseract_(software)
[9] https://en.wikipedia.org/wiki/TIFF
[10] https://en.wikipedia.org/wiki/BMP_file_format
[11] https://docs.python.org/2/library/sys.html
[12] https://www.reviversoft.com/file-extensions/box
Published
2018-11-05
How to Cite
Nair, V., Athul, K. A., & Kartha, B. A. (2018, November 5). Search and Highlight of Required Substrings in Printed Documents using OCR. ASIAN JOURNAL FOR CONVERGENCE IN TECHNOLOGY (AJCT ) -UGC LISTED, 4(II). https://doi.org/https://doi.org/10.33130/asian%20journals.v4iII.616
Section
Article