gmft.detectors.img2table module

The Img2Table detector is another option for table extraction. The detector directly produces FormattedTable instances, so dataframes are available immediately (no need to pass through a formatter). The primary purpose is to provide an alternative to the default TATRDetector. The module relies on the fantastic img2table library.

The img2table library natively relies on PyMuPDF and its line detection feature. While line detection has been ported to PyPDFium2, the ported version only an approximation, so if you are able to meet the AGPL-3.0 license requirements, PyMuPDF is recommended. See the PyMuPDF section for more information.

from gmft_pymupdf import PyMuPDFDocument

from gmft.detectors.img2table import Img2TableDetector

doc = PyMuPDFDocument("path/to/pdf") # PyMuPDF is preferred
# PyPdfium2 is possible, but line breaks and img2table performance may be less accurate
# doc = PyPDFium2Document("path/to/pdf")

detector = Img2TableDetector()
fts = [detector.extract(table) for table in tables] # type: list[FormattedTable]

See also: the corresponding tests.

class gmft.detectors.img2table.Img2TableDetector(config: Img2TableDetectorConfig = None)

Bases: BaseDetector[Img2TableDetectorConfig]

extract(page: BasePage, config_overrides: Img2TableDetectorConfig = None) list[gmft.detectors.img2table.Img2TableTable]

Extract tables from a page.

Parameters:
  • page – BasePage

  • config_overrides – override the config for this call only

Returns:

list of CroppedTable objects

class gmft.detectors.img2table.Img2TableDetectorConfig(implicit_rows: bool = False, implicit_columns: bool = False, borderless_tables: bool = False, min_confidence: int = 50)

Bases: object

borderless_tables: bool = False
implicit_columns: bool = False
implicit_rows: bool = False
min_confidence: int = 50
class gmft.detectors.img2table.Img2TablePDFDocument(*args: Any, **kwargs: Any)

Bases: BasePDFDocument, PDF

Wraps a BasePDFdocument in the img2table format

property bytes
close()
get_filename() str
get_page(n: int) BasePage

Get 0-indexed page

get_table_content(tables: dict[int, list[img2table.tables.objects.table.Table]], ocr: img2table.ocr.base.OCRInstance, min_confidence: int) dict[int, list[img2table.tables.objects.extraction.ExtractedTable]]
property images: list[numpy.ndarray]
ocr_df = None
property src
class gmft.detectors.img2table.Img2TablePage(*args: Any, **kwargs: Any)

Bases: PDF

Wraps a BasePage as a singleton document in the img2table format, because detectors work on a page level.

property bytes
get_table_content(tables: dict[int, list[img2table.tables.objects.table.Table]], ocr: img2table.ocr.base.OCRInstance, min_confidence: int) dict[int, list[img2table.tables.objects.extraction.ExtractedTable]]
property images: list[numpy.ndarray]
ocr_df = None
property src
class gmft.detectors.img2table.Img2TableTable(table: img2table.tables.objects.extraction.ExtractedTable, page: BasePage)

Bases: FormattedTable

Construct a CroppedTable object.

Parameters:
  • page – BasePage

  • bbox – tuple of (xmin, ymin, xmax, ymax) or Rect object

  • confidence_score – confidence score of the table detection

  • label – label of the table detection. 0 means table 1 means rotated table

df(recalculate=False, config_overrides=None) pandas.DataFrame

Return the table as a pandas dataframe.

static from_dict(d: dict, page: BasePage)

Deserialize from dict

to_dict()

Serialize self into dict

visualize()

Visualize the table.