You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the PDF, we find a multi-column style like this:
Current Output
Whilst it might appear okay, the parser has lost the information relating to the order of these sentence parts, so reconstructing the actual paragraphs becomes difficult.
Ieder jaar laat Provincie Utrecht circa 10% van het grens met Provincie Noord-Holland. Weer naar het zui-
landelijk gebied onderzoeken op flora en fauna. In den vormt eerst de Angstel en later de A2 de grens tot
Desired Output
Ieder jaar laat Provincie Utrecht circa 10% van het
landelijk gebied onderzoeken op flora en fauna. In
...
grens met Provincie Noord-Holland. Weer naar het zui-
den vormt eerst de Angstel en later de A2 de grens tot
aan de bebouwde kom van Maarssen. Naar het oosten
toe vormt de bebouwde kom van Utrecht de grens tot
aan de Utrechtse Heuvelrug bij De Bilt en Bilthoven.
Suggested Solution
the PDFtoText module permits the use of a 'raw' parameter that swaps the 'default' parsing mode to 'raw', which is the order in which content appears in the stream.
for i, page in enumerate(pdftotext.PDF(f, raw=True), start=1):
It could be even better to parse all PDFs to xml, as that allows for retention of more structure, and for instance allows the identification of headings, table of contents and tables.
The text was updated successfully, but these errors were encountered:
Let's take as an example: https://www.stateninformatie.provincie-utrecht.nl/api/v1/meetings/8992/documents/23264
In the PDF, we find a multi-column style like this:
Current Output
Whilst it might appear okay, the parser has lost the information relating to the order of these sentence parts, so reconstructing the actual paragraphs becomes difficult.
Desired Output
Suggested Solution
the PDFtoText module permits the use of a 'raw' parameter that swaps the 'default' parsing mode to 'raw', which is the order in which content appears in the stream.
open-raadsinformatie/ocd_backend/utils/file_parsing.py
Line 15 in 01a2859
It could be even better to parse all PDFs to xml, as that allows for retention of more structure, and for instance allows the identification of headings, table of contents and tables.
The text was updated successfully, but these errors were encountered: