Poor PDF parsing results, especially in multi-column documents #430

tsaltena · 2021-12-06T15:58:37Z

Let's take as an example: https://www.stateninformatie.provincie-utrecht.nl/api/v1/meetings/8992/documents/23264

In the PDF, we find a multi-column style like this:

Current Output

Whilst it might appear okay, the parser has lost the information relating to the order of these sentence parts, so reconstructing the actual paragraphs becomes difficult.

Ieder jaar laat Provincie Utrecht circa 10% van het    grens met Provincie Noord-Holland. Weer naar het zui-
landelijk gebied onderzoeken op flora en fauna. In        den vormt eerst de Angstel en later de A2 de grens tot

Desired Output

Ieder jaar laat Provincie Utrecht circa 10% van het
landelijk gebied onderzoeken op flora en fauna. In
...
grens met Provincie Noord-Holland. Weer naar het zui-
den vormt eerst de Angstel en later de A2 de grens tot
aan de bebouwde kom van Maarssen. Naar het oosten
toe vormt de bebouwde kom van Utrecht de grens tot
aan de Utrechtse Heuvelrug bij De Bilt en Bilthoven.

Suggested Solution

the PDFtoText module permits the use of a 'raw' parameter that swaps the 'default' parsing mode to 'raw', which is the order in which content appears in the stream.

open-raadsinformatie/ocd_backend/utils/file_parsing.py

Line 15 in 01a2859

for i, page in enumerate(pdftotext.PDF(f), start=1):

      for i, page in enumerate(pdftotext.PDF(f, raw=True), start=1):

It could be even better to parse all PDFs to xml, as that allows for retention of more structure, and for instance allows the identification of headings, table of contents and tables.

The text was updated successfully, but these errors were encountered:

joepio · 2021-12-06T16:40:38Z

Thanks for pointing this out, and taking the time to write up a proposal, @tsaltena! Much appreciated.

Seems like a simple fix. I do need to check whether this doesn't break existing pages.

joepio · 2024-11-07T14:46:56Z

@robvandijk maybe this is a good issue to take a look at when you can import, this would be a relevant functionality to improve!

joepio assigned robvandijk Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor PDF parsing results, especially in multi-column documents #430

Poor PDF parsing results, especially in multi-column documents #430

tsaltena commented Dec 6, 2021 •

edited

Loading

joepio commented Dec 6, 2021

joepio commented Nov 7, 2024

Poor PDF parsing results, especially in multi-column documents #430

Poor PDF parsing results, especially in multi-column documents #430

Comments

tsaltena commented Dec 6, 2021 • edited Loading

Current Output

Desired Output

Suggested Solution

joepio commented Dec 6, 2021

joepio commented Nov 7, 2024

tsaltena commented Dec 6, 2021 •

edited

Loading