You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Below are the findings from testing various formats with the library:
Supported Formats
Image:
✅ Working as expected.
Word Documents (e.g., .docx):
✅ Working as expected.
Excel Sheets (e.g., .xlsx):
✅ Working as expected.
HTML:
✅ Working as expected.
XML:
✅ Working as expected.
ZIP Files:
✅ Working as expected.
Partially Supported Formats
PDF: ⚠️ Content is being extracted but not converted into proper Markdown format.
PowerPoint Presentations (e.g., .pptx): ⚠️ Content is being extracted but not converted into proper Markdown format.
Unsupported Formats or Errors
Audio (e.g., .wav):
❌ Encountering an error when processing .wav files. NameError: name 'IS_AUDIO_TRANSCRIPTION_CAPABLE' is not defined
JSON:
❌ UnsupportedFormatException: Could not convert 'data.json' to Markdown. The formats ['.json', '.json'] are not supported.
CSV:
❌ UnsupportedFormatException: Could not convert 'data.csv' to Markdown. The formats ['.csv'] are not supported.
Suggestions for Improvement
Provide Metadata Information:
Include additional metadata, such as page numbers, in the extracted Markdown content. This can be useful for tracking and reference purposes.
Handle Embedded Images in PDFs and Documents:
Utilize LLM models, such as GPT-4 or similar, to extract and describe images embedded in PDFs and other documents. Many real-world documents include critical visual information interspersed with text.
Improve PDF Text Extraction:
Observed that the library uses pdfminer.high_level.extract_text, which extracts only the text. Consider integrating an enhanced approach to extract richer data, such as layout-aware text and embedded elements (e.g., tables, images).
The text was updated successfully, but these errors were encountered:
Library version:
python version 3.11.9
Summary of Observations
Below are the findings from testing various formats with the library:
Supported Formats
Image:
✅ Working as expected.
Word Documents (e.g.,
.docx
):✅ Working as expected.
Excel Sheets (e.g.,
.xlsx
):✅ Working as expected.
HTML:
✅ Working as expected.
XML:
✅ Working as expected.
ZIP Files:
✅ Working as expected.
Partially Supported Formats
PDF:
⚠️ Content is being extracted but not converted into proper Markdown format.
PowerPoint Presentations (e.g.,
⚠️ Content is being extracted but not converted into proper Markdown format.
.pptx
):Unsupported Formats or Errors
Audio (e.g.,
.wav
):❌ Encountering an error when processing
.wav
files. NameError: name 'IS_AUDIO_TRANSCRIPTION_CAPABLE' is not definedJSON:
❌
UnsupportedFormatException
: Could not convert'data.json'
to Markdown. The formats['.json', '.json']
are not supported.CSV:
❌
UnsupportedFormatException
: Could not convert'data.csv'
to Markdown. The formats['.csv']
are not supported.Suggestions for Improvement
Provide Metadata Information:
Handle Embedded Images in PDFs and Documents:
Improve PDF Text Extraction:
pdfminer.high_level.extract_text
, which extracts only the text. Consider integrating an enhanced approach to extract richer data, such as layout-aware text and embedded elements (e.g., tables, images).The text was updated successfully, but these errors were encountered: