gmft.detectors.img2table module
The Img2Table detector is another option for table extraction. The detector directly produces FormattedTable
instances, so dataframes are available immediately (no need to pass through a formatter). The primary purpose is to
provide an alternative to the default TATRDetector.
The module relies on the fantastic img2table library.
The img2table library natively relies on PyMuPDF and its line detection feature. While line detection has been ported to PyPDFium2, the ported version only an approximation, so if you are able to meet the AGPL-3.0 license requirements, PyMuPDF is recommended. See the PyMuPDF section for more information.
from gmft_pymupdf import PyMuPDFDocument
from gmft.detectors.img2table import Img2TableDetector
doc = PyMuPDFDocument("path/to/pdf") # PyMuPDF is preferred
# PyPdfium2 is possible, but line breaks and img2table performance may be less accurate
# doc = PyPDFium2Document("path/to/pdf")
detector = Img2TableDetector()
fts = [detector.extract(table) for table in tables] # type: list[FormattedTable]
See also: the corresponding tests.
- class gmft.detectors.img2table.Img2TableDetector(config: Img2TableDetectorConfig = None)
Bases:
BaseDetector[Img2TableDetectorConfig]- extract(page: BasePage, config_overrides: Img2TableDetectorConfig = None) list[gmft.detectors.img2table.Img2TableTable]
Extract tables from a page.
- Parameters:
page – BasePage
config_overrides – override the config for this call only
- Returns:
list of CroppedTable objects
- class gmft.detectors.img2table.Img2TableDetectorConfig(implicit_rows: bool = False, implicit_columns: bool = False, borderless_tables: bool = False, min_confidence: int = 50)
Bases:
object
- class gmft.detectors.img2table.Img2TablePDFDocument(*args: Any, **kwargs: Any)
Bases:
BasePDFDocument,PDFWraps a BasePDFdocument in the img2table format
- property bytes
- close()
- get_table_content(tables: dict[int, list[img2table.tables.objects.table.Table]], ocr: img2table.ocr.base.OCRInstance, min_confidence: int) dict[int, list[img2table.tables.objects.extraction.ExtractedTable]]
- ocr_df = None
- property src
- class gmft.detectors.img2table.Img2TablePage(*args: Any, **kwargs: Any)
Bases:
PDFWraps a BasePage as a singleton document in the img2table format, because detectors work on a page level.
- property bytes
- get_table_content(tables: dict[int, list[img2table.tables.objects.table.Table]], ocr: img2table.ocr.base.OCRInstance, min_confidence: int) dict[int, list[img2table.tables.objects.extraction.ExtractedTable]]
- ocr_df = None
- property src
- class gmft.detectors.img2table.Img2TableTable(table: img2table.tables.objects.extraction.ExtractedTable, page: BasePage)
Bases:
FormattedTableConstruct a CroppedTable object.
- Parameters:
page – BasePage
bbox – tuple of (xmin, ymin, xmax, ymax) or Rect object
confidence_score – confidence score of the table detection
label – label of the table detection. 0 means table 1 means rotated table
- df(recalculate=False, config_overrides=None) pandas.DataFrame
Return the table as a pandas dataframe.
- to_dict()
Serialize self into dict
- visualize()
Visualize the table.