Bug, Suggestion: Improve Markdown Conversion, Format Support, and Rich Content Extraction #216

saiprathapreddychinta · 2024-12-25T17:50:59Z

Library version:

python version 3.11.9

Below are the findings from testing various formats with the library:

PDF:
⚠️ Content is being extracted but not converted into proper Markdown format.
PowerPoint Presentations (e.g., .pptx):
⚠️ Content is being extracted but not converted into proper Markdown format.

Audio (e.g., .wav):
❌ Encountering an error when processing .wav files. NameError: name 'IS_AUDIO_TRANSCRIPTION_CAPABLE' is not defined
JSON:
❌ UnsupportedFormatException: Could not convert 'data.json' to Markdown. The formats ['.json', '.json'] are not supported.
CSV:
❌ UnsupportedFormatException: Could not convert 'data.csv' to Markdown. The formats ['.csv'] are not supported.

Provide Metadata Information:
- Include additional metadata, such as page numbers, in the extracted Markdown content. This can be useful for tracking and reference purposes.
Handle Embedded Images in PDFs and Documents:
- Utilize LLM models, such as GPT-4 or similar, to extract and describe images embedded in PDFs and other documents. Many real-world documents include critical visual information interspersed with text.
Improve PDF Text Extraction:
- Observed that the library uses pdfminer.high_level.extract_text, which extracts only the text. Consider integrating an enhanced approach to extract richer data, such as layout-aware text and embedded elements (e.g., tables, images).

The text was updated successfully, but these errors were encountered:

l-lumin · 2024-12-26T03:19:58Z

related to

Provide feedback