GitHub - foodoh/ocrd_menus: OCR's text files for all the hotels in Bangalore. Tesseract OCR engine was used for the purpose

OCR'd menus

We have used tesseract as the OCR engine.

Further more we have divided the images to

dark colored	light colored

Which allows us to tweak us the OCR algorithm accordingly and help it perform better

The processed images are stored in tesseract_menu_data

Used selenium to automate the interaction with http://free-ocr.com It Has been giving better results than the tesseract

Note:

Requirements for that: $ pip install selenium

processed_files.sh: shows the ratio of menu images and the processed files in dir. (To keep track of things!)

Processed images stored in : menu_text (A total of 101 hotel menus were processed with each hotel having at least 4 menu images in them).

rmgarbage Implements the various rules presented in the paper Automatic Removal of “Garbage Strings” in OCR Text: An Implementation which helps us decide whether a string is a valid one or garbage.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
menu_images		menu_images
menu_text		menu_text
references		references
tesseract_menu_data		tesseract_menu_data
.gitignore		.gitignore
README.md		README.md
clean.py		clean.py
free_ocr_selenium.py		free_ocr_selenium.py
processed_files.sh		processed_files.sh
rename_to_json.py		rename_to_json.py