gmft.detectors.base module

A collection of common objects used by detectors.

Type Hierarchy:

CroppedTable
- RotatedCroppedTable
  
  FormattedTable
BaseDetector
- TATRDetector
- Img2TableDetector

BaseDetector is the base class for all detectors.

Module containing methods of detecting tables from whole pdf pages.

Example:

>>> from gmft.auto import AutoTableDetector

class gmft.detectors.base.BaseDetector

Bases: ABC, Generic[ConfigT]

Abstract base class for table detectors.

detect(page: BasePage, config_overrides: ConfigT = None, **kwargs) → list[gmft.detectors.base.CroppedTable]: Alias for extract().

abstract extract(page: BasePage, config_overrides: ConfigT = None) → list[gmft.detectors.base.CroppedTable]

Extract tables from a page.

Parameters:

page – BasePage
config_overrides – override the config for this call only

Returns:

list of CroppedTable objects

class gmft.detectors.base.CroppedTable(page: BasePage, bbox: tuple[int, int, int, int] | Rect, confidence_score: float = 1.0, label=0, *, angle: Literal[0, 90, 180, 270] = 0)

Bases: object

A pdf selection, cropped to include just a table. Created by BaseDetector.

Construct a CroppedTable object.

Parameters:

page – BasePage
bbox – tuple of (xmin, ymin, xmax, ymax) or Rect object
confidence_score – confidence score of the table detection
label – label of the table detection. 0 means table 1 means rotated table

property bbox

captions(margin=None, line_spacing=2.5, **kwargs) → tuple[str, str]

Look for a caption in the table.

Since this method is somewhat slow, the result is cached if captions() is called with default arguments.

Parameters:

margin – margin around the table to search for captions. Positive margin = expands the table.
line_spacing – minimum line spacing to consider two lines as separate.

Returns:

tuple[str, str]: [caption_above, caption_below]

static from_dict(d: dict, page: BasePage) → CroppedTable | RotatedCroppedTable

Deserialize a CroppedTable object from dict.

Because file locations may change, require the user to provide the original page - but as a helper method see PyPDFium2Utils.load_page_from_dict and PyPDFium2Utils.reload

These are required entries of the dict: - filename (str) - page_no (int) - bbox (list of x0, y0, x1, y1)

These entries were formerly required: - confidence_score (float) - label (int)

These entries are optional: - angle (one of 0, 90, 180, 270)

Parameters:

d – dict
page – BasePage

Returns:

CroppedTable object

static from_image_only(img: Image) → CroppedTable

Create a CroppedTable object from an image only.

Parameters:: img – PIL image
Returns:: CroppedTable object

property height

image(dpi: int = None, padding: tuple[int, int, int, int] | Literal['auto', None] = None, margin: tuple[int, int, int, int] | Literal['auto', None] = None) → Image

Return the image of the cropped table.

Following pypdfium2, scaling_factor = (dpi / 72). Therefore, dpi=72 is the default, and dpi=144 is x2 zoom.

Parameters:

dpi – dots per inch. If not None, the scaling_factor parameter is ignored.
padding – padding (blank pixels) to add to the image. Tuple of (left, top, right, bottom) Padding (blank pixels) is added after the crop and rotation. Padding is important for subsequent row/column detection; see https://github.com/microsoft/table-transformer/issues/68 for discussion. If padding = ‘auto’, the padding is automatically set to 10% of the larger of {width, height}. Default is no padding.
margin – add content (in pdf units) from the original pdf beyond the detected table bbox boundary.

Returns:

image of the cropped table

predicted_word_height(smallest_supported_text_height=0.1): Get the predicted height of standard text in the table. If there are no words, np.nan is returned.

text()

Return the text of the cropped table.

Any words that intersect the table are captured, even if they are not fully contained.

Returns:: text of the cropped table

text_positions(remove_table_offset: bool = False, outside: bool = False) → Generator[tuple[int, int, int, int, str], None, None]

Return the text positions of the cropped table.

Any words that intersect the table are captured, even if they are not fully contained.

Parameters:

remove_table_offset – if True, the coordinates are transformed (rotated and translated) so that the top-left corner of the table is (0, 0) and the bottom-right corner is (width, height). If False, transforms (including rotation) are ignored and original coordinates are returned.
outside – if True, returns the complement of the table: all the text positions outside the table. (default: False)

Returns:

list of text positions, which is a tuple (x0, y0, x1, y1, "string")

to_dict()

visualize(show_text=False, **kwargs): Visualize the cropped table.

property width

class gmft.detectors.base.RotatedCroppedTable(page: BasePage, bbox: tuple[int, int, int, int], confidence_score: float, angle: float, label=0)

Bases: CroppedTable

Table that has been rotated.

Note: self.bbox and self.rect are in coordinates of the original pdf. But text_positions() can possibly give transformed coordinates.

Currently, only 0, 90, 180, and 270 degree rotations are supported. An angle of 90 would mean that a 90 degree cc rotation has been applied to a level image.

In practice, most rotated tables are rotated by 90 degrees.

Note: after v0.5, this class is nearly identical to CroppedTable. angle is now directly availble in CroppedTable.

Construct a CroppedTable object.

Parameters:

page – BasePage
bbox – tuple of (xmin, ymin, xmax, ymax) or Rect object
confidence_score – confidence score of the table detection
label – label of the table detection. 0 means table 1 means rotated table

static from_dict(d: dict, page: BasePage) → CroppedTable | RotatedCroppedTable: Create a RotatedCroppedTable object from dict.

gmft.detectors.base.position_words(words: Generator[tuple[int, int, int, int, str], None, None], y_gap=3): Helper function to convert a list of words with positions to a string.