Product

Quivr made an Open-source Document Parsing Tool

MegaParse, our innovative open-source parsing framework, is on a mission to simplify and enhance this process at the fastest pace possible.

Author
Stan Girard
Published on
December 3, 2024

Parsing documents, particularly unstructured ones, is a challenging task that requires addressing multiple obstacles to ensure accurate and efficient results—an essential step for effective Retrieval-Augmented Generation (RAG) document ingestion.

MegaParse, our innovative open-source parsing framework, is on a mission to simplify and enhance this process at the fastest pace possible.

In this article, we’ll discuss the challenges we’ve encountered, our strategies for improvement, and our vision for the future of document parsing.

Challenge 1: Parsing Tables Accurately

Tables in a document are invaluable time-savers for readers, so why are they such a challenge for modern parsing solutions? Whether extracted from a DOCX file, a native PDF, or an image-based PDF, tables often suffer from poor rendering due to their complex layouts. Modern parsing methods, like Unstructured or doctr, employ layout models that generally succeed in segmenting tables. However, these models alone fall short of reconstructing structured tables accurately, especially when relying on OCR for content extraction.

Our Approach

While layout models achieve around 90% accuracy in table segmentation, the real challenge lies in accurately rendering the table's content. MegaParse addresses this nuanced issue through two primary strategies:

  • Large Language Models (LLMs): Leverage the capabilities of LLMs to reconstruct tables based on the initial "draft" extracted during parsing.
  • General Large Vision Models (LVMs): Extract table information by isolating the cropped table from the page and incorporating it into the parsing output.

Future Direction: To further enhance performance, we plan to develop or integrate a specialized, smaller vision model specifically designed for table parsing.

Challenge 2: Identifying Document Structure

Understanding the hierarchy of a document—titles, subtitles, sections, and parts—is critical for accurate chunking and structured parsing. Although current layout detection provides some insights, it falls short of delivering the granularity and accuracy needed to make sure we have the best contexts to deliver to LLMs.

Our Approach

We are developing a structured parsing system that will act as a post-parsing module. This system will organize the ingested document based on a Pydantic BaseModel, leveraging open-source tools such as Outlines to streamline the process. I mean, isn't structured information the best ?

Challenge 3: Focusing on Relevant Content

Documents often contain non-informative elements like logos, headers, and footers, which can disrupt the flow of parsing and reduce the overall information density. Effectively handling these elements is essential for producing clean and meaningful outputs.

Our Approach

By comparing pages within a document, we identify headers and footers based on their consistent appearance at the top or bottom of multiple pages. Once detected, we remove these repetitive elements and seamlessly reassemble content that spans across pages to maintain continuity.

Challenge 4: OCR or PDF Reader?

At Quivr, optimizing processing time across every step of our RAG pipeline is a top priority. This means MegaParse must strike a careful balance between performance and speed. A critical decision point is determining when to use OCR versus a PDF reader.

Our Approach

Currently, we employ a straightforward heuristic: for each page, we calculate the proportion of the area covered by images. If more than 50% of the page consists of images, OCR is used, as bypassing it would compromise accuracy despite its higher processing time. Otherwise, we rely on efficient libraries like pdfminer.six to extract PDF content directly.

Modular Parsing with MegaParse

At MegaParse, we see parsing as the first foundational step in any document processing pipeline. Our framework is built with modularity in mind, allowing users to integrate post-processing blocks tailored to their unique requirements. This flexibility not only provides greater control but also enables precise handling of specific parsing challenges—many of which, such as those highlighted in the first three challenges, are addressed through these post-processing modules.

For example, in the context of food product labels, we employ structured output generation to extract key details like product names, ingredients, and nutritional information. While we currently rely on Anthropic's structured output capabilities for this task due to their exceptional performance, this solution is costly. To mitigate this, we plan to develop a streamlined, cost-effective structured parsing pipeline using smaller models and leveraging tools like Outlines, as outlined in Challenge 2.

Join the Open Source MegaParse Community

We’re committed to making MegaParse accessible to everyone. Check out our GitHub repository, explore pre-built parsing or post processing modules, or create your own. By contributing, you can help shape the future of document parsing and make MegaParse even more powerful.

Click here to join—we can't wait to see you PARSE

Conclusion

Parsing unstructured documents is full of challenges, from handling poorly rendered tables to managing document hierarchy and headers. With MegaParse, we’re addressing these difficulties through innovative modularity and collaboration. Our goal is to create a robust, modular parsing framework that adapts to any use case and empowers users to process documents with ease just ad they need it.

Interested to know more ? Contact us at [email protected], we are always cooking. 🐾

Weekly newsletter

No spam. Just the latest releases and tips, interesting articles and use cases, and exclusive interviews in your inbox every week.

Subscribed successfully
Oops! Something went wrong. Please try again.

Latest blog posts

See all blog posts