gmft.detectors.base module

A collection of common objects used by detectors.

Type Hierarchy:

  • CroppedTable
    • RotatedCroppedTable
      • FormattedTable

  • BaseDetector
    • TATRDetector

    • Img2TableDetector

BaseDetector is the base class for all detectors.

Module containing methods of detecting tables from whole pdf pages.

Example:
>>> from gmft.auto import AutoTableDetector
class gmft.detectors.base.BaseDetector

Bases: ABC, Generic[ConfigT]

Abstract base class for table detectors.

detect(page: BasePage, config_overrides: ConfigT = None, **kwargs) list[gmft.detectors.base.CroppedTable]

Alias for extract().

abstract extract(page: BasePage, config_overrides: ConfigT = None) list[gmft.detectors.base.CroppedTable]

Extract tables from a page.

Parameters:
  • page – BasePage

  • config_overrides – override the config for this call only

Returns:

list of CroppedTable objects

class gmft.detectors.base.CroppedTable(page: BasePage, bbox: tuple[int, int, int, int] | Rect, confidence_score: float = 1.0, label=0, *, angle: Literal[0, 90, 180, 270] = 0)

Bases: object

A pdf selection, cropped to include just a table. Created by BaseDetector.

Construct a CroppedTable object.

Parameters:
  • page – BasePage

  • bbox – tuple of (xmin, ymin, xmax, ymax) or Rect object

  • confidence_score – confidence score of the table detection

  • label – label of the table detection. 0 means table 1 means rotated table

property bbox
captions(margin=None, line_spacing=2.5, **kwargs) tuple[str, str]

Look for a caption in the table.

Since this method is somewhat slow, the result is cached if captions() is called with default arguments.

Parameters:
  • margin – margin around the table to search for captions. Positive margin = expands the table.

  • line_spacing – minimum line spacing to consider two lines as separate.

Returns:

tuple[str, str]: [caption_above, caption_below]

static from_dict(d: dict, page: BasePage) CroppedTable | RotatedCroppedTable

Deserialize a CroppedTable object from dict.

Because file locations may change, require the user to provide the original page - but as a helper method see PyPDFium2Utils.load_page_from_dict and PyPDFium2Utils.reload

These are required entries of the dict: - filename (str) - page_no (int) - bbox (list of x0, y0, x1, y1)

These entries were formerly required: - confidence_score (float) - label (int)

These entries are optional: - angle (one of 0, 90, 180, 270)

Parameters:
  • d – dict

  • page – BasePage

Returns:

CroppedTable object

static from_image_only(img: Image) CroppedTable

Create a CroppedTable object from an image only.

Parameters:

img – PIL image

Returns:

CroppedTable object

property height
image(dpi: int = None, padding: tuple[int, int, int, int] | Literal['auto', None] = None, margin: tuple[int, int, int, int] | Literal['auto', None] = None) Image

Return the image of the cropped table.

Following pypdfium2, scaling_factor = (dpi / 72). Therefore, dpi=72 is the default, and dpi=144 is x2 zoom.

Parameters:
  • dpi – dots per inch. If not None, the scaling_factor parameter is ignored.

  • padding – padding (blank pixels) to add to the image. Tuple of (left, top, right, bottom) Padding (blank pixels) is added after the crop and rotation. Padding is important for subsequent row/column detection; see https://github.com/microsoft/table-transformer/issues/68 for discussion. If padding = ‘auto’, the padding is automatically set to 10% of the larger of {width, height}. Default is no padding.

  • margin – add content (in pdf units) from the original pdf beyond the detected table bbox boundary.

Returns:

image of the cropped table

predicted_word_height(smallest_supported_text_height=0.1)

Get the predicted height of standard text in the table. If there are no words, np.nan is returned.

text()

Return the text of the cropped table.

Any words that intersect the table are captured, even if they are not fully contained.

Returns:

text of the cropped table

text_positions(remove_table_offset: bool = False, outside: bool = False) Generator[tuple[int, int, int, int, str], None, None]

Return the text positions of the cropped table.

Any words that intersect the table are captured, even if they are not fully contained.

Parameters:
  • remove_table_offset – if True, the coordinates are transformed (rotated and translated) so that the top-left corner of the table is (0, 0) and the bottom-right corner is (width, height). If False, transforms (including rotation) are ignored and original coordinates are returned.

  • outside – if True, returns the complement of the table: all the text positions outside the table. (default: False)

Returns:

list of text positions, which is a tuple (x0, y0, x1, y1, "string")

to_dict()
visualize(show_text=False, **kwargs)

Visualize the cropped table.

property width
class gmft.detectors.base.RotatedCroppedTable(page: BasePage, bbox: tuple[int, int, int, int], confidence_score: float, angle: float, label=0)

Bases: CroppedTable

Table that has been rotated.

Note: self.bbox and self.rect are in coordinates of the original pdf. But text_positions() can possibly give transformed coordinates.

Currently, only 0, 90, 180, and 270 degree rotations are supported. An angle of 90 would mean that a 90 degree cc rotation has been applied to a level image.

In practice, most rotated tables are rotated by 90 degrees.

Note: after v0.5, this class is nearly identical to CroppedTable. angle is now directly availble in CroppedTable.

Construct a CroppedTable object.

Parameters:
  • page – BasePage

  • bbox – tuple of (xmin, ymin, xmax, ymax) or Rect object

  • confidence_score – confidence score of the table detection

  • label – label of the table detection. 0 means table 1 means rotated table

static from_dict(d: dict, page: BasePage) CroppedTable | RotatedCroppedTable

Create a RotatedCroppedTable object from dict.

gmft.detectors.base.position_words(words: Generator[tuple[int, int, int, int, str], None, None], y_gap=3)

Helper function to convert a list of words with positions to a string.