Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor PDF parsing results, especially in multi-column documents #430

Open
tsaltena opened this issue Dec 6, 2021 · 2 comments
Open

Poor PDF parsing results, especially in multi-column documents #430

tsaltena opened this issue Dec 6, 2021 · 2 comments
Assignees

Comments

@tsaltena
Copy link

tsaltena commented Dec 6, 2021

Let's take as an example: https://www.stateninformatie.provincie-utrecht.nl/api/v1/meetings/8992/documents/23264

In the PDF, we find a multi-column style like this:

image

Current Output

Whilst it might appear okay, the parser has lost the information relating to the order of these sentence parts, so reconstructing the actual paragraphs becomes difficult.

Ieder jaar laat Provincie Utrecht circa 10% van het    grens met Provincie Noord-Holland. Weer naar het zui-
landelijk gebied onderzoeken op flora en fauna. In        den vormt eerst de Angstel en later de A2 de grens tot

Desired Output

Ieder jaar laat Provincie Utrecht circa 10% van het
landelijk gebied onderzoeken op flora en fauna. In
...
grens met Provincie Noord-Holland. Weer naar het zui-
den vormt eerst de Angstel en later de A2 de grens tot
aan de bebouwde kom van Maarssen. Naar het oosten
toe vormt de bebouwde kom van Utrecht de grens tot
aan de Utrechtse Heuvelrug bij De Bilt en Bilthoven.

Suggested Solution

the PDFtoText module permits the use of a 'raw' parameter that swaps the 'default' parsing mode to 'raw', which is the order in which content appears in the stream.

for i, page in enumerate(pdftotext.PDF(f), start=1):

      for i, page in enumerate(pdftotext.PDF(f, raw=True), start=1):

It could be even better to parse all PDFs to xml, as that allows for retention of more structure, and for instance allows the identification of headings, table of contents and tables.

@joepio
Copy link
Contributor

joepio commented Dec 6, 2021

Thanks for pointing this out, and taking the time to write up a proposal, @tsaltena! Much appreciated.

Seems like a simple fix. I do need to check whether this doesn't break existing pages.

@joepio
Copy link
Contributor

joepio commented Nov 7, 2024

@robvandijk maybe this is a good issue to take a look at when you can import, this would be a relevant functionality to improve!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants