gmft.formatters.tatr module

class gmft.formatters.tatr.TATRFormatConfig(warn_uninitialized_weights: bool = False, image_processor_path: str = 'microsoft/table-transformer-detection', formatter_path: str = 'microsoft/table-transformer-structure-recognition', no_timm: bool = True, torch_device: ~typing.Literal['cpu', 'cuda', 'auto'] | str = 'auto', verbosity: int = 1, formatter_base_threshold: float = 0.3, cell_required_confidence: dict = <factory>, remove_null_rows: bool = True, enable_multi_header: bool = False, semantic_spanning_cells: bool = False, semantic_hierarchical_left_fill: ~typing.Literal['algorithm', 'deep', None] = 'algorithm', large_table_if_n_rows_removed: int = 8, large_table_threshold: int = 10, large_table_row_overlap_threshold: float = 0.2, large_table_maximum_rows: int = 1000, force_large_table_assumption: bool | None = None, total_overlap_reject_threshold: float = 0.9, total_overlap_warn_threshold: float = 0.1, nms_warn_threshold: int = 5, iob_reject_threshold: float = 0.05, iob_warn_threshold: float = 0.5, _nms_overlap_threshold: float = 0.1, _large_table_merge_distance: float = 0.6, _smallest_supported_text_height: float = 0.1)

Bases: LegacyRemovedConfig

Configuration for TATRTableFormatter.

cell_required_confidence: dict

Confidences required (>=) for a row/column feature to be considered good. See TATRFormattedTable.id2label

But low confidences may be better than too high confidence (see formatter_base_threshold)

enable_multi_header: bool = False: Enable multi-indices in the dataframe. If false, then multiple headers will be merged vertically.

force_large_table_assumption: bool | None = None

True: force large table assumption to be applied to all tables.

False: force large table assumption to not be applied to any tables.

None: heuristically apply large table assumption according to threshold and overlap

formatter_base_threshold: float = 0.3: Base threshold for the confidence demanded of a table feature (row/column). Note that a low threshold is actually better, because overzealous rows means that generally, numbers are still aligned and there are just many empty rows (having fewer rows than expected merges cells, which is bad).

formatter_path: str = 'microsoft/table-transformer-structure-recognition'

image_processor_path: str = 'microsoft/table-transformer-detection'

iob_reject_threshold: float = 0.05: Reject if iob between textbox and cell is < 5%.

iob_warn_threshold: float = 0.5: Warn if iob between textbox and cell is < 50%.

large_table_if_n_rows_removed: int = 8: If >= n rows are removed due to non-maxima suppression (NMS), then this table is classified as a large table.

large_table_maximum_rows: int = 1000: If the table predicts a large number of rows, refuse to proceed. Therefore prevent memory issues for super small text.

large_table_row_overlap_threshold: float = 0.2

With large tables, table transformer struggles with placing too many overlapping rows. Luckily, with more rows, we have more info on the usual size of text, which we can use to make a guess on the height such that no rows are merged or overlapping.

Large table assumption is only applied when (# of rows > large_table_threshold) AND (total overlap > large_table_row_overlap_threshold). Set 9999 to disable; set 0 to force large table assumption to run every time.

large_table_threshold: int = 10

With large tables, table transformer struggles with placing too many overlapping rows. Luckily, with more rows, we have more info on the usual size of text, which we can use to make a guess on the height such that no rows are merged or overlapping.

Large table assumption is only applied when (# of rows > large_table_threshold) AND (total overlap > large_table_row_overlap_threshold). Set 9999 to disable; set 0 to force large table assumption to run every time.

nms_warn_threshold: int = 5: Warn if non maxima suppression removes > 5 rows.

no_timm: bool = True

remove_null_rows: bool = True: Remove rows with no text.

semantic_hierarchical_left_fill: Literal['algorithm', 'deep', None] = 'algorithm'

[Experimental] When semantic spanning cells is enabled, when a left header is detected which might represent a group of rows, that same value is reduplicated for each row. Possible values: ‘algorithm’, ‘deep’, None.

‘algorithm’: assumes that the higher-level header is always the first row followed by several empty rows.

‘deep’: merges headers according to the spanning cells detected by the Table Transformer.

None: headers are not duplicated.

semantic_spanning_cells: bool = False: [Experimental] Enable semantic spanning cells, which often encode hierarchical multi-level indices.

torch_device: Literal['cpu', 'cuda', 'auto'] | str = 'auto'

total_overlap_reject_threshold: float = 0.9: Reject if total overlap is > 90% of table area.

total_overlap_warn_threshold: float = 0.1: Warn if total overlap is > 10% of table area.

verbosity: int = 1

-1: no logging

0: errors only

1: print warnings

2: print warnings and info

3: print warnings, info, and debug

warn_uninitialized_weights: bool = False

class gmft.formatters.tatr.TATRFormattedTable(cropped_table: CroppedTable, fctn_results: dict, config: TATRFormatConfig = None)

Bases: FormattedTable, LegacyFctnResults

FormattedTable, as seen by a Table Transformer (TATR). See TATRTableFormatter.

Construct a CroppedTable object.

Parameters:

page – BasePage
bbox – tuple of (xmin, ymin, xmax, ymax) or Rect object
confidence_score – confidence score of the table detection
label – label of the table detection. 0 means table 1 means rotated table

config: TATRFormatConfig

df(recalculate=False, config_overrides: TATRFormatConfig = None): Return the table as a pandas dataframe. :param recalculate: by default, the dataframe is cached :param config_overrides: override the config settings for this call only

static from_dict(d: dict, page: BasePage): Deserialize from dict. A page is required partly because of memory management, since having this open a page may cause memory issues.

id2label = {0: 'table', 1: 'table column', 2: 'table row', 3: 'table column header', 4: 'table projected row header', 5: 'table spanning cell', 6: 'no object'}

label2id = {'no object': 6, 'table': 0, 'table column': 1, 'table column header': 3, 'table projected row header': 4, 'table row': 2, 'table spanning cell': 5}

outliers: dict[str, bool]

recompute(config: TATRFormatConfig): Recompute the internal dataframe.

to_dict(): Serialize self into dict

visualize(filter=None, dpi=None, padding=None, margin=(10, 10, 10, 10), effective=False, return_img=True, **kwargs)

Visualize the table.

Parameters:

filter – filter the labels to visualize. See TATRFormattedTable.id2label
dpi – Sets the dpi. If none, then the dpi of the cached image is used.
padding – padding around the table. If None, then the padding of the cached image is used.
margin – margin around the table. If None, then the margin of the cached image is used.
effective – if True, visualize the effective rows and columns, which may differ from the table transformer’s output.
return_img – if True, return the image. If False, the matplotlib figure is plotted.

class gmft.formatters.tatr.TATRFormatter(config: TATRFormatConfig = None)

Bases: TableFormatter

Uses a TableTransformerForObjectDetection for small/medium tables, and a custom algorithm for large tables.

Using extract(), a FormattedTable is produced, which can be exported to csv, df, etc.

extract(table: CroppedTable, dpi=144, padding='auto', margin=None, config_overrides=None) → TATRFormattedTable: Extract the data from the table.