Usage

Installation

First, install pytorch and transformers with the desired GPU/CPU setup.

Then, gmft can be installed with pip:

(.venv) $ pip install gmft

Quickstart

The quickstart notebook is a good place to start.

To extract many tables, the bulk extract notebook is a good place to start.

For example,

from gmft.auto import CroppedTable, AutoTableDetector, AutoTableFormatter
from gmft.pdf_bindings import PyPDFium2Document

detector = AutoTableDetector()
formatter = AutoTableFormatter()

def ingest_pdf(pdf_path): # produces list[CroppedTable]
    doc = PyPDFium2Document(pdf_path)
    tables = []
    for page in doc:
        tables += detector.extract(page)
    return tables, doc

tables, doc = ingest_pdf("path/to/pdf.pdf")
doc.close() # once you're done with the document

Overview

Documents are represented by a BasePDFDocument object. The default implementation is PyPDFium2Document, which uses the PyPDFium2 library. Within a document, the BasePage is implemented by default with PyPDFium2Page.

The AutoTableDetector is the recommended table detection tool, which currently uses Microsoft’s Table Transformer. They produce CroppedTable objects, from which CroppedTable.image() permits image export.

The AutoTableFormatter is the recommended table formatting tool, from which FormattedTable.df() permits dataframe export. All TableFormatters produce FormattedTable objects, which contain the original CroppedTable and the formatted dataframe.

PyMuPDF

PyMuPDF is the pdf parser of choice, and I recommend PyMuPDF due to its better performance, accuracy, and very powerful line break detection feature.

However, PyMuPDF requires compliance with the AGPL-3.0 license, so it is not included in gmft by default. To use PyMuPDF, refer to the gmft_pymupdf repository. Once installed, PyMuPDFDocument can be used in place of PyPDFium2Document.

Line detection. Some functionality (like the img2table detector and rich table formatting) depends on quality line detection. While line detection has been ported to PyPDFium2 via BasePage._get_positions_and_text_and_breaks(), it is only an imperfect approximation, so extraction won’t be as accurate as if PyMuPDF is used.

pip install git+https://github.com/conjuncts/gmft_pymupdf.git
from gmft_pymupdf import PyMuPDFDocument

doc = PyMuPDFDocument("path/to/pdf")
tables = detector.extract(doc)

# gmft remains unchanged
from gmft.auto import AutoTableDetector
detector = AutoTableDetector()

tables = []
for page in doc:
    tables += detector.extract(page)