Config Guide
============
The AutoTableDetector and AutoTableFormatter have separate configurations. This guide focuses on the **formatter** side.
Basics
-------
The :class:`~gmft.auto.AutoFormatConfig` object can be passed into either the :class:`~gmft.auto.AutoTableFormatter` constructor or the :meth:`~gmft.formatters.tatr.TATRFormatter.df` method.
For example:
.. code-block:: python
from gmft.auto import AutoFormatConfig, AutoTableFormatter
# ... code here
config = AutoFormatConfig(verbosity=3)
formatter = AutoTableFormatter(config=config)
ft = formatter.format(table)
df = ft.df() # formatter's tables automatically uses settings of config
config_overrides = AutoFormatConfig(enable_multi_header=True)
df = ft.df(config_overrides=config_overrides) # if provided, config_overrides replaces config, so verbosity is reverted
df = ft.df(config_overrides={"enable_multi_header": True) # pass dict to keep verbosity setting
New behavior in v0.3:
If `config_overrides` is provided, it completely replaces everything in `config`. For instance, if a value is
set in `config` but left unassigned in `config_overrides`, the resultant object will **revert** to
the default value.
In versions <0.3, assigned values in `config_overrides` would have been merged into `config`.
In the above example, the resultant object would have previously contained the value from `config`.
To retain this old behavior, a dict can be passed.
.. _semantic_spanning_cells:
Semantic Spanning
------------------
The **semantic spanning cells** setting supports headers with multiple rows or columns.
Supported spanning cells can either be on the top or left header of the table.
.. figure:: /images/spanning_hier_left.png
:alt: spanning hierarchical left
Fig 1. Spanning Hierarchical Left Header
.. figure:: /images/spanning_hier_top.png
:alt: spanning hierarchical top
Fig 2. Spanning Hierarchical Top Header
.. raw:: html
Table 1. semantic_spanning_cells=True
|
Dataset |
Total Tables \nInvestigated† |
Total Tables \nwith a PRH∗ |
Tables with an oversegmented PRH \nTotal |
Tables with an oversegmented PRH \n% (of total with a PRH) |
Tables with an oversegmented PRH \n% (of total investigated) |
| 0 |
SciTSR |
10,431 |
342 |
54 |
15.79% |
0.52% |
| 1 |
PubTabNet |
422,491 |
100,159 |
58,747 |
58.65% |
13.90% |
| 2 |
FinTabNet |
70,028 |
25,637 |
25,348 |
98.87% |
36.20% |
| 3 |
PubTables-1M (ours) |
761,262 |
153,705 |
0 |
0% |
0% |
Enable Multi Header
--------------------
A slight **misnomer**, **enable multi header** only enforces that the pandas dataframe has multiple headers.
This setting does not need to be enabled for semantic spanning cells (ie. hierarchical top or left headers) to be processed.
If this setting is false, then all the headers are condensed into one header.
Multi-line (and hence hierarchical) information is preserved through ``\n`` characters.
.. raw:: html
Table 2. semantic_spanning_cells=True, enable_multi_header=True
| Header 2 |
NaN |
NaN |
NaN |
Tables with an oversegmented PRH |
Tables with an oversegmented PRH |
Tables with an oversegmented PRH |
| Header 1 |
Dataset |
Total Tables \nInvestigated† |
Total Tables \nwith a PRH∗ |
Total |
% (of total with a PRH) |
% (of total investigated) |
| 0 |
SciTSR |
10,431 |
342 |
54 |
15.79% |
0.52% |
| 1 |
PubTabNet |
422,491 |
100,159 |
58,747 |
58.65% |
13.90% |
| 2 |
FinTabNet |
70,028 |
25,637 |
25,348 |
98.87% |
36.20% |
| 3 |
PubTables-1M (ours) |
761,262 |
153,705 |
0 |
0% |
0% |
.. _large_table_assumption:
Large Table Assumption
-----------------------
The **large table assumption** is a mechanic that improves performance on large tables.
Here, algorithmically generated rows are used instead of deep learning.
By default, large table assumption activates under these conditions:
At least one of these:
1. More than ``large_table_if_n_rows_removed`` rows are removed (default: >= 8)
2. OR all of the following are true:
* Measured overlap of rows exceeds ``large_table_row_overlap_threshold`` (default: 20%)
* AND the number of rows is greater than ``large_table_threshold`` (default: >= 10)
Large table assumption can be directly turned on/off with ``config.large_table_assumption = True/False``.
.. list-table::
* - .. figure:: /images/lta_off.png
Fig 3. Deep bboxes
- .. figure:: /images/lta_on.png
Fig 4. Large Table Assumption on
.. raw:: html
Fig. 3 and 4 Credits: © C. Dougherty 2001, 2002 (c.dougherty@lse.ac.uk). These tables have been computed to accompany the text C. Dougherty Introduction to Econometrics (second edition 2002, Oxford University Press, Oxford). They may be reproduced freely provided that this attribution is retained.