gmft.detectors.img2table module

The Img2Table detector is another option for table extraction. The detector directly produces FormattedTable instances, so dataframes are available immediately (no need to pass through a formatter). The primary purpose is to provide an alternative to the default TATRDetector. The module relies on the fantastic img2table library.

The img2table library natively relies on PyMuPDF and its line detection feature. While line detection has been ported to PyPDFium2, the ported version only an approximation, so if you are able to meet the AGPL-3.0 license requirements, PyMuPDF is recommended. See the PyMuPDF section for more information.

from gmft_pymupdf import PyMuPDFDocument

from gmft.detectors.img2table import Img2TableDetector

doc = PyMuPDFDocument("path/to/pdf") # PyMuPDF is preferred
# PyPdfium2 is possible, but line breaks and img2table performance may be less accurate
# doc = PyPDFium2Document("path/to/pdf")

detector = Img2TableDetector()
fts = [detector.extract(table) for table in tables] # type: list[FormattedTable]