robertofarias
Published on

Improving Document Processing with Docling: A Practical Example

Docling is an interesting new tool for converting documents to Markdown/JSON.

From the README:

Reads popular document formats (PDF, DOCX, PPTX, Images, HTML, AsciiDoc, Markdown) and exports to Markdown and JSON.

I tried the CLI interface (it also has a Python interface), and it worked well!

Why do I think it's interesting?

Improving document ingestion for LLMs or RAG systems is key to getting good answers, and this tool represents a big improvement in that area.

Demo

Installation

Note: Installation may take some time.

pipx install docling

Usage

The CLI usage is simple. Just pass the file you want to convert as an argument:

docling table.pdf

This generates an .md file.

For example, using this sample PDF:

GitHub Traffic

Resulting Markdown:

## Example table

## This is an example of a data table.

| Disability | Participants | Ballots   | Ballots                | Results Time to | Results Time to |
| ---------- | ------------ | --------- | ---------------------- | --------------- | --------------- |
| Category   |              | Completed | Incomplete/ Terminated | Accuracy        | complete        |
| Blind      | 5            | 1         | 4                      | 34.5%, n=1      | 1199 sec, n=1   |
| Low Vision | 5            | 2         | 3                      | 98.3% n=2       | 1716 sec, n=3   |
| Dexterity  | 5            | 4         | 1                      | 98.3%, n=4      | 1672.1 sec, n=4 |
| Mobility   | 3            | 3         | 0                      | 95.4%, n=3      | 1416 sec, n=3   |

While it's not perfect, it's a significant improvement over what I used to do in the past with pdftotext:

pdftotext table.pdf

Which resulted in:

Example table

This is an example of a data table.
Results

Participants

Ballots
Completed

Ballots
Incomplete/
Terminated

Blind

5

1

4

34.5%, n=1

1199 sec, n=1

Low Vision

5

2

3

98.3% n=2

1716 sec, n=3

(97.7%, n=3)

(1934 sec, n=2)

Disability
Category

Accuracy

Time to
complete

Dexterity

5

4

1

98.3%, n=4

1672.1 sec, n=4

Mobility

3

3

0

95.4%, n=3

1416 sec, n=3