Config Guide

The AutoTableDetector and AutoTableFormatter have separate configurations. This guide focuses on the formatter side.

Basics

The AutoFormatConfig object can be passed into either the AutoTableFormatter constructor or the df() method.

For example:

from gmft.auto import AutoFormatConfig, AutoTableFormatter

# ... code here

config = AutoFormatConfig(verbosity=3)
formatter = AutoTableFormatter(config=config)

ft = formatter.format(table)
df = ft.df() # formatter's tables automatically uses settings of config

config_overrides = AutoFormatConfig(enable_multi_header=True)
df = ft.df(config_overrides=config_overrides) # if provided, config_overrides replaces config, so verbosity is reverted

df = ft.df(config_overrides={"enable_multi_header": True) # pass dict to keep verbosity setting

New behavior in v0.3: If config_overrides is provided, it completely replaces everything in config. For instance, if a value is set in config but left unassigned in config_overrides, the resultant object will revert to the default value.

In versions <0.3, assigned values in config_overrides would have been merged into config. In the above example, the resultant object would have previously contained the value from config. To retain this old behavior, a dict can be passed.

Semantic Spanning

The semantic spanning cells setting supports headers with multiple rows or columns.

Supported spanning cells can either be on the top or left header of the table.

spanning hierarchical left — Fig 1. Spanning Hierarchical Left Header

spanning hierarchical top — Fig 2. Spanning Hierarchical Top Header

Table 1. `semantic_spanning_cells=True`
	Dataset	Total Tables \nInvestigated†	Total Tables \nwith a PRH∗	Tables with an oversegmented PRH \nTotal	Tables with an oversegmented PRH \n% (of total with a PRH)	Tables with an oversegmented PRH \n% (of total investigated)
0	SciTSR	10,431	342	54	15.79%	0.52%
1	PubTabNet	422,491	100,159	58,747	58.65%	13.90%
2	FinTabNet	70,028	25,637	25,348	98.87%	36.20%
3	PubTables-1M (ours)	761,262	153,705	0	0%	0%

Enable Multi Header

A slight misnomer, enable multi header only enforces that the pandas dataframe has multiple headers.

This setting does not need to be enabled for semantic spanning cells (ie. hierarchical top or left headers) to be processed.

If this setting is false, then all the headers are condensed into one header. Multi-line (and hence hierarchical) information is preserved through \n characters.

Table 2. `semantic_spanning_cells=True, enable_multi_header=True`
Header 2	NaN	NaN	NaN	Tables with an oversegmented PRH	Tables with an oversegmented PRH	Tables with an oversegmented PRH
Header 1	Dataset	Total Tables \nInvestigated†	Total Tables \nwith a PRH∗	Total	% (of total with a PRH)	% (of total investigated)
0	SciTSR	10,431	342	54	15.79%	0.52%
1	PubTabNet	422,491	100,159	58,747	58.65%	13.90%
2	FinTabNet	70,028	25,637	25,348	98.87%	36.20%
3	PubTables-1M (ours)	761,262	153,705	0	0%	0%

Large Table Assumption

The large table assumption is a mechanic that improves performance on large tables. Here, algorithmically generated rows are used instead of deep learning.

By default, large table assumption activates under these conditions:

At least one of these: 1. More than large_table_if_n_rows_removed rows are removed (default: >= 8) 2. OR all of the following are true:

Measured overlap of rows exceeds large_table_row_overlap_threshold (default: 20%)

AND the number of rows is greater than large_table_threshold (default: >= 10)

Large table assumption can be directly turned on/off with config.large_table_assumption = True/False.

_images/lta_off.png — Fig 3. Deep bboxes

_images/lta_on.png — Fig 4. Large Table Assumption on

Fig. 3 and 4 Credits: © C. Dougherty 2001, 2002 (c.dougherty@lse.ac.uk). These tables have been computed to accompany the text C. Dougherty Introduction to Econometrics (second edition 2002, Oxford University Press, Oxford). They may be reproduced freely provided that this attribution is retained.