Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug, Suggestion: Improve Markdown Conversion, Format Support, and Rich Content Extraction #216

Open
saiprathapreddychinta opened this issue Dec 25, 2024 · 1 comment

Comments

@saiprathapreddychinta
Copy link

saiprathapreddychinta commented Dec 25, 2024

Library version:

  • markitdown==0.0.1a3

python version 3.11.9

Summary of Observations

Below are the findings from testing various formats with the library:

Supported Formats

  1. Image:
    ✅ Working as expected.

  2. Word Documents (e.g., .docx):
    ✅ Working as expected.

  3. Excel Sheets (e.g., .xlsx):
    ✅ Working as expected.

  4. HTML:
    ✅ Working as expected.

  5. XML:
    ✅ Working as expected.

  6. ZIP Files:
    ✅ Working as expected.


Partially Supported Formats

  1. PDF:
    ⚠️ Content is being extracted but not converted into proper Markdown format.

  2. PowerPoint Presentations (e.g., .pptx):
    ⚠️ Content is being extracted but not converted into proper Markdown format.


Unsupported Formats or Errors

  1. Audio (e.g., .wav):
    ❌ Encountering an error when processing .wav files. NameError: name 'IS_AUDIO_TRANSCRIPTION_CAPABLE' is not defined

  2. JSON:
    UnsupportedFormatException: Could not convert 'data.json' to Markdown. The formats ['.json', '.json'] are not supported.

  3. CSV:
    UnsupportedFormatException: Could not convert 'data.csv' to Markdown. The formats ['.csv'] are not supported.


Suggestions for Improvement

  1. Provide Metadata Information:

    • Include additional metadata, such as page numbers, in the extracted Markdown content. This can be useful for tracking and reference purposes.
  2. Handle Embedded Images in PDFs and Documents:

    • Utilize LLM models, such as GPT-4 or similar, to extract and describe images embedded in PDFs and other documents. Many real-world documents include critical visual information interspersed with text.
  3. Improve PDF Text Extraction:

    • Observed that the library uses pdfminer.high_level.extract_text, which extracts only the text. Consider integrating an enhanced approach to extract richer data, such as layout-aware text and embedded elements (e.g., tables, images).
@l-lumin
Copy link
Contributor

l-lumin commented Dec 26, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants