gmft.formatters.tatr module
- class gmft.formatters.tatr.TATRFormatConfig(warn_uninitialized_weights: bool = False, image_processor_path: str = 'microsoft/table-transformer-detection', formatter_path: str = 'microsoft/table-transformer-structure-recognition', no_timm: bool = True, torch_device: ~typing.Literal['cpu', 'cuda', 'auto'] | str = 'auto', verbosity: int = 1, formatter_base_threshold: float = 0.3, cell_required_confidence: dict = <factory>, remove_null_rows: bool = True, enable_multi_header: bool = False, semantic_spanning_cells: bool = False, semantic_hierarchical_left_fill: ~typing.Literal['algorithm', 'deep', None] = 'algorithm', large_table_if_n_rows_removed: int = 8, large_table_threshold: int = 10, large_table_row_overlap_threshold: float = 0.2, large_table_maximum_rows: int = 1000, force_large_table_assumption: bool | None = None, total_overlap_reject_threshold: float = 0.9, total_overlap_warn_threshold: float = 0.1, nms_warn_threshold: int = 5, iob_reject_threshold: float = 0.05, iob_warn_threshold: float = 0.5, _nms_overlap_threshold: float = 0.1, _large_table_merge_distance: float = 0.6, _smallest_supported_text_height: float = 0.1)
Bases:
LegacyRemovedConfigConfiguration for
TATRTableFormatter.- cell_required_confidence: dict
Confidences required (>=) for a row/column feature to be considered good. See TATRFormattedTable.id2label
But low confidences may be better than too high confidence (see formatter_base_threshold)
- enable_multi_header: bool = False
Enable multi-indices in the dataframe. If false, then multiple headers will be merged vertically.
- force_large_table_assumption: bool | None = None
True: force large table assumption to be applied to all tables.
False: force large table assumption to not be applied to any tables.
None: heuristically apply large table assumption according to threshold and overlap
- formatter_base_threshold: float = 0.3
Base threshold for the confidence demanded of a table feature (row/column). Note that a low threshold is actually better, because overzealous rows means that generally, numbers are still aligned and there are just many empty rows (having fewer rows than expected merges cells, which is bad).
- large_table_if_n_rows_removed: int = 8
If >= n rows are removed due to non-maxima suppression (NMS), then this table is classified as a large table.
- large_table_maximum_rows: int = 1000
If the table predicts a large number of rows, refuse to proceed. Therefore prevent memory issues for super small text.
- large_table_row_overlap_threshold: float = 0.2
With large tables, table transformer struggles with placing too many overlapping rows. Luckily, with more rows, we have more info on the usual size of text, which we can use to make a guess on the height such that no rows are merged or overlapping.
Large table assumption is only applied when (# of rows > large_table_threshold) AND (total overlap > large_table_row_overlap_threshold). Set 9999 to disable; set 0 to force large table assumption to run every time.
- large_table_threshold: int = 10
With large tables, table transformer struggles with placing too many overlapping rows. Luckily, with more rows, we have more info on the usual size of text, which we can use to make a guess on the height such that no rows are merged or overlapping.
Large table assumption is only applied when (# of rows > large_table_threshold) AND (total overlap > large_table_row_overlap_threshold). Set 9999 to disable; set 0 to force large table assumption to run every time.
- semantic_hierarchical_left_fill: Literal['algorithm', 'deep', None] = 'algorithm'
[Experimental] When semantic spanning cells is enabled, when a left header is detected which might represent a group of rows, that same value is reduplicated for each row. Possible values: ‘algorithm’, ‘deep’, None.
‘algorithm’: assumes that the higher-level header is always the first row followed by several empty rows.
‘deep’: merges headers according to the spanning cells detected by the Table Transformer.
None: headers are not duplicated.
- semantic_spanning_cells: bool = False
[Experimental] Enable semantic spanning cells, which often encode hierarchical multi-level indices.
- class gmft.formatters.tatr.TATRFormattedTable(cropped_table: CroppedTable, fctn_results: dict, config: TATRFormatConfig = None)
Bases:
FormattedTable,LegacyFctnResultsFormattedTable, as seen by a Table Transformer (TATR). See
TATRTableFormatter.Construct a CroppedTable object.
- Parameters:
page – BasePage
bbox – tuple of (xmin, ymin, xmax, ymax) or Rect object
confidence_score – confidence score of the table detection
label – label of the table detection. 0 means table 1 means rotated table
- config: TATRFormatConfig
- df(recalculate=False, config_overrides: TATRFormatConfig = None)
Return the table as a pandas dataframe. :param recalculate: by default, the dataframe is cached :param config_overrides: override the config settings for this call only
- static from_dict(d: dict, page: BasePage)
Deserialize from dict. A page is required partly because of memory management, since having this open a page may cause memory issues.
- id2label = {0: 'table', 1: 'table column', 2: 'table row', 3: 'table column header', 4: 'table projected row header', 5: 'table spanning cell', 6: 'no object'}
- label2id = {'no object': 6, 'table': 0, 'table column': 1, 'table column header': 3, 'table projected row header': 4, 'table row': 2, 'table spanning cell': 5}
- recompute(config: TATRFormatConfig)
Recompute the internal dataframe.
- to_dict()
Serialize self into dict
- visualize(filter=None, dpi=None, padding=None, margin=(10, 10, 10, 10), effective=False, return_img=True, **kwargs)
Visualize the table.
- Parameters:
filter – filter the labels to visualize. See TATRFormattedTable.id2label
dpi – Sets the dpi. If none, then the dpi of the cached image is used.
padding – padding around the table. If None, then the padding of the cached image is used.
margin – margin around the table. If None, then the margin of the cached image is used.
effective – if True, visualize the effective rows and columns, which may differ from the table transformer’s output.
return_img – if True, return the image. If False, the matplotlib figure is plotted.
- class gmft.formatters.tatr.TATRFormatter(config: TATRFormatConfig = None)
Bases:
TableFormatterUses a TableTransformerForObjectDetection for small/medium tables, and a custom algorithm for large tables.
Using
extract(), aFormattedTableis produced, which can be exported to csv, df, etc.- extract(table: CroppedTable, dpi=144, padding='auto', margin=None, config_overrides=None) TATRFormattedTable
Extract the data from the table.