LLM Document Extraction: How to use AI to get structured data from legacy documents

blog preview

We at Pondhouse AI heavily rely on extracting data from documents. Word, pdf, excel, you name it. All of these formats have one thing in common: They allow the creator almost unlimited freedom in how to layout the content. And that's the problem. While it's great to have this freedom in creating the documents, it's a nightmare to extract the data from them. Especially when the data is rather technical and provides complicated layouts like multicolumn pdfs or tables.

Traditional methods of data extraction, such as rule-based systems or even conventional machine learning approaches, often struggle with the complexity and variability of document layouts. This is where Large Language Models (LLMs) come into play. With their ability to understand context, interpret natural language, and adapt to various formats, LLMs offer a potentially promising solution to the challenges of getting structured data from documents.

In this blog post, we'll explore how LLMs can be used to extract data from complex documents, compare their performance with traditional methods, and discuss the potential benefits and limitations of this approach.

TL;DR If you are just interested in the code for how to use an Open Source LLM for retrieving data from documents, simply jump to the Conclusion.

The challenge of document data extraction

One of the primary challenges in document data extraction is the sheer diversity of formats and layouts. Documents come in numerous file types, including:

  • PDFs (Portable Document Format)
  • Microsoft Word documents (.doc, .docx)
  • Excel spreadsheets (.xls, .xlsx)
  • Scanned images (.jpg, .png, .tiff)
  • HTML files
  • Plain text files (.txt)

Each of these formats has its own structure and peculiarities. Moreover, within each format, there's an almost infinite variety of possible layouts. Documents can include:

  • Single or multiple columns
  • Tables and charts
  • Headers and footers
  • Sidebars and text boxes
  • Images and diagrams
  • Varying fonts, sizes, and styles

This variability makes it extremely difficult to create a one-size-fits-all solution. A method that works perfectly for one document may fail entirely when applied to another, even if they contain similar information.

Complexities in Technical Documents and Multi-Column PDFs

Technical documents and multi-column PDFs present their own set of challenges:

  • Technical Jargon: These documents often contain specialized terminology and acronyms that require domain-specific knowledge to interpret correctly.

  • Complex Tables: Technical documents frequently include tables with merged cells, nested headers, or footnotes.

  • Equations and Formulas: Mathematical or chemical equations can be difficult to parse and interpret programmatically.

  • Multi-Column Layouts: PDF documents with multiple columns can confuse text extraction algorithms, as they need to determine the correct reading order across columns.

  • Footnotes and References: These elements can interrupt the main text flow, complicating the extraction process.

  • Diagrams and Flowcharts: Technical diagrams often contain crucial information but are challenging to interpret automatically.

Limitations of Conventional Extraction Methods

Traditional methods of data extraction have several limitations when dealing with complex documents:

  • Rule-Based Systems: These systems rely on predefined rules to extract data. While they can be effective for standardized documents, they struggle with variability and require constant updating as document formats change.

  • Regular Expressions: While powerful for pattern matching, regex struggles with context understanding and can be brittle when document layouts change.

  • Optical Character Recognition (OCR): While OCR has improved significantly, it can still produce errors, especially with poor-quality scans or complex layouts.

  • Template Matching: This method works well for documents with a fixed structure but fails when layouts deviate from the expected template.

  • Lack of Context Understanding: Conventional methods often extract data based on position or pattern, without understanding the meaning or context of the information.

  • Difficulty with Unstructured Data: Many traditional methods struggle with free-form text or documents without a clear structure.

  • Limited Adaptability: Conventional systems often require extensive retraining or reprogramming to handle new document types or formats.

These limitations highlight the need for more advanced, flexible, and intelligent approaches to document data extraction. As we'll explore in the following sections, Large Language Models (LLMs) offer promising solutions to many of these challenges, bringing a new level of understanding and adaptability to the task of extracting information from complex documents.

Using LLMs for Structured Data Document Extraction

Now that we know the challenges, let's think of a solution. The perfect solution would be a system with the following capabilities:

  • Can understand the context of the document

  • Is able to interpret natural language

  • Adapts to various formats and layouts

  • Handles technical jargon and complex tables

  • Handles various languages and writing styles

  • Preserves the structure of the

  • Preserves the structure of the document (eg., tables, columns)

  • High accuracy and reliability

If we look at these requirements, it might be worth a try to use Large Language Models for that. LLMs have great language understanding, they are not bound to formats, you can advise them to format the output as you require it require it and they can be trained on technical jargon.

The two remaining questions are: How can we even provide our documents to the LLMs? And do they perform well enough to be used in such cases? Let's find out.

Disclaimer: This post shows a hands-on example of how to use LLMs for document extraction. We'll also comment on the performance of the test, however we're not providing a full benchmark here. But you will get the required information needed to create your own.

Hands-on: Using MiniCPM-Llama3-V2.5 - an Open Source LLM for Structured Data Extraction

For this test, we'll use the MiniCPM-Llama3-V2.5 model, an open-source Large Language Model, based on Llama3 with impressive vision capabilities. With only 8B parameters, this model surpasses widely used proprietary models like GPT-4V-1106, Gemini Pro, Claude 3 and Qwen-VL-Max when when it comes to both vision and language tasks - according to benchmarks benchmarks provided by the creators.

Also interesting for us, the model seems to be incredible good for OCR tasks: According to the team behind the model, MiniCPM-Llama3-V 2.5 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344), achieving an 700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro.

How to provide documents to the Large Language Model?

Ok, that all sounds great, but how can one "send" a document to the LLM? LLMs don't have the capabilities to open and read documents. While that's true, modern multi-modal LLMs can "read" and "understand" images. You see where this where this is going? We can convert the documents to images and provide these images to the models. The model will then use their vision capabilities to understand the document and extract the text accordingly.

At first, we were a little hesitant to use this method for data extraction, because it seems quite brute-force. But, if you consider how humans read and understand documents, there are similarities. Our eyes simply create images of the documents, sends these images to the brain, where the understanding (and extraction if you so will) happens. Not that different of what we are proposing here.

The system we are about to build therefore looks as follows:

Document Extraction SystemDocument Extraction System

The code to extract structured data from documents

Enough of the theory, let's get our hands dirty.

Note: This tutorial assumes you have a CUDA-enabled GPU with at least 16 GB of VRAM.

First, we need to install the required libraries:

1pip install Pillow torch torchvision transformers sentencepiece pymupdf

Note: We use PyMuPDF to extract text from PDFs. It has a copyleft license, so make sure to check if it fits your use case. But you can use any other library which is able to screenshot pdfs to pngs.

Secondly, let's import the required models and load the model and tokenizer:

1import torch
2from PIL import Image
3from transformers import AutoModel, AutoTokenizer
4import fitz # PyMuPDF
5
6model = AutoModel.from_pretrained(
7 "openbmb/MiniCPM-Llama3-V-2_5", trust_remote_code=True, torch_dtype=torch.float16
8)
9model = model.to(device="cuda")
10
11tokenizer = AutoTokenizer.from_pretrained(
12 "openbmb/MiniCPM-Llama3-V-2_5", trust_remote_code=True
13)
14model.eval()

This might take a while, so grab a coffee while you wait.

Next, we want to create images from our document. For this tutorial, we'll use a PDF file and create one image per page:

1pdf_path = "mypdf.pdf"
2pdf_document = fitz.open(pdf_path)
3
4images = []
5
6# Loop through each page in the PDF
7for page_number in range(len(pdf_document)):
8 page = pdf_document.load_page(page_number)
9 pix = page.get_pixmap()
10 img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
11
12 images.append(img)

No further explanation needed here. We simply loop through all the pages of the document and create an image from each page. These images are collect in the images list.

And finally, we can send the images to the model and extract the text: (We'll provide the code for a single image, but you can easily loop for all images in the list)

1question = """Extract all the text in this image.
2If there is a header or a footer, just ignore it.
3Extract tables as markdown tables.
4Don't use the subtitles for the list items, just return the list as text.
5"""
6msgs = [{"role": "user", "content": question}]
7
8res = model.chat(
9 image=images[0],
10 msgs=msgs,
11 tokenizer=tokenizer,
12 sampling=True,
13 temperature=0.7,
14 # system_prompt="" # pass system_prompt if needed
15)
16print(res)

Also pretty straightforward. We provide the image to the model, along with a question, and the model will return the extracted text.

As to be seen from the question variable, we can simply prompt the model for how we want the text to be extracted. Our experience shows, that the models understanding of complex requests is a little limited, as it's just based on the Llama3 8B model. Nevertheless, for a surprising amount of requests it works very good - like removing headers and footers from the actual output.

Note: Make sure to use a concurrent and batched version of this sample when dealing with larger amounts of data.

Considering that we just use an off-the-shelf, rather small LLM, the results are phenomenal. The model was able to extract both the text as well as tables in or examples in a very reasonable format.

One thing we recognized during testing: The model always tries to output the text with prefixes, like:

1**title**: "The actual title"
2**text**: "The actual text"
3**subtitle**: "The subtitle"
4**text**: "The text of this paragraph"
5...

This has some advantages as well as disadvantages. The advantage is that you have the semantics of the text right in the output. Especially useful if you want to chunk the text later on. However, if you just want the text, this might be a little annoying.

We tried to remove these prefixes using prompting techniques, however the model seems to be very stubborn in this regard. It seems, that this kind of output is deeply ingrained in the training provided to the model. So, the best strategy is to simply remove these prefixes after the output. In our tests, the prefixes were very reliably constant, a simple regex might do the trick.

Conclusion

In this blog post, we explored the challenges of extracting data from complex documents and discussed the limitations of traditional methods. We then introduced Large Language Models (LLMs) as a potential solution to these challenges and demonstrated how to use an open-source LLM for document data extraction.

Our hands-on example showed that LLMs can be highly effective in extracting text from documents, even in cases with complex layouts and technical content. While the model we used was relatively small, it performed well and provided accurate results.

We are very excited about these results, as they were not only good in terms of extracted texts, but they provide us a completely new level of flexibility for our data extraction pipelines - as we can instruct the LLM on how to specifically extract the data.

Furthermore, it was also impressive to see how well the model handled the extraction of tables (and various layouts we did not describe here but tested the model on). This is a significant improvement over anything we had in the past.

As a next step, we plan to compare this comparably small and cheap model with the "king" of vision models at the moment - GPT-4o. Let's see, if the MiniCPM-Llama3-V2.5 can hold up and whether a much bigger model is actually necessary for document extraction. Stay tuned.

Summary: The code for the complete example

1import torch
2from PIL import Image
3from transformers import AutoModel, AutoTokenizer
4import fitz # PyMuPDF
5
6model = AutoModel.from_pretrained(
7 "openbmb/MiniCPM-Llama3-V-2_5", trust_remote_code=True, torch_dtype=torch.float16
8)
9model = model.to(device="cuda")
10
11tokenizer = AutoTokenizer.from_pretrained(
12 "openbmb/MiniCPM-Llama3-V-2_5", trust_remote_code=True
13)
14model.eval()
15
16pdf_path = "mypdf.pdf"
17pdf_document = fitz.open(pdf_path)
18
19images = []
20
21for page_number in range(len(pdf_document)):
22 page = pdf_document.load_page(page_number)
23 pix = page.get_pixmap()
24 img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
25
26 images.append(img)
27
28pdf_document.close()
29
30question = """Extract all the text in this image.
31If there is a header or a footer, just ignore it.
32Extract tables as markdown tables.
33Don't use the subtitles for the list items, just return the list as text.
34"""
35msgs = [{"role": "user", "content": question}]
36
37res = model.chat(
38 image=images[0], # loop through all images in images list here
39 msgs=msgs,
40 tokenizer=tokenizer,
41 sampling=True,
42 temperature=0.7,
43 # system_prompt='' # pass system_prompt if needed
44)
45print(res)

Further reading

More information on our managed RAG solution?
To Pondhouse AI
More tips and tricks on how to work with AI?
To our Blog