-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reading order: WORD-based vs. top-level based #24
Comments
Also, from the 1st example it is obvious it would be better if the non-textual top-level blocks (image regions etc) would also be part of the extracted reading order (so for example one can post-process image captions differently). |
So regarding the open question how to deal with textract2page/textract2page/convert_aws.py Lines 925 to 927 in 55fe416
Thus, apparently, for every |
This is really interesting! Thank you for the great illustrations! (Were they produced by hand, or using a utility? If the latter, is it available?) |
The example images are produced via builtin screenshot facility of the native PAGE-XML view of OCR-D Browser, a Gtk (i.e. Linux) GUI. Of which currently probably the best version is hnesk/browse-ocrd#64. There is also a Docker version, which runs the Gtk app in the browser (also best built locally via For the sake of completeness, here are the results as of the current state of #23 (much better than shown above): Plus as a new example we stumbled upon cases of LINE within FIGURE (which must become TextRegions in ImageRegions) and PARAGRAPH within LIST (which must become TextRegion in TextRegion): |
Note: OCR-D Browser expects a METS-XML representation in OCR-D conventions. If you just need a direct viewer, consider using PRImA PageViewer, which is written in Java and thus platform-independent. |
The current implementation extracts the ReadingOrder from the top-level parents of all
WORD
blocks (in the order of these word blocks). This seems to be necessary for cases withTABLE
results.However, for
LAYOUT_*
blocks, the results look much better if the top-level blocks are directly taken as the order – as implemented in #23.For example, here is how both implementations compare:
The first page is a typical newspaper page (added to the tests in #23) and shows how #23 is better.
The 2nd and 3rd example are taken from the test suite. The 2nd shows that the current implementation is better, because #23 places the table after all the other regions.
In the 3rd example (
nd1969
test case) has AWS results which obviously look bad either way, with 2 false tables and many highly overlapping column regions.The text was updated successfully, but these errors were encountered: