We have used tesseract
as the OCR engine.
Further more we have divided the images to
dark colored | light colored |
---|
Which allows us to tweak us the OCR algorithm accordingly and help it perform better
The processed images are stored in tesseract_menu_data
Used selenium
to automate the interaction with http://free-ocr.com
It Has been giving better results than the tesseract
Note:
Requirements for that: $ pip install selenium
- Implemented in
free_ocr_selenium.py
processed_files.sh
: shows the ratio of menu images and the processed files in dir. (To keep track of things!)
Processed images stored in : menu_text (A total of 101 hotel menus were processed with each hotel having at least 4 menu images in them).
rmgarbage Implements the various rules presented in the paper Automatic Removal of “Garbage Strings” in OCR Text: An Implementation which helps us decide whether a string is a valid one or garbage.