gmft.pdf_bindings.base

With a common interface, gmft supports interchangeable documents. The key requirements are as follows:

  1. The document should be composed of multiple pages.

  2. The document must expose text with their corresponding locations (bboxes).

  3. The document must expose a way to obtain images for each page.

As a consequence of #2, OCR can be implemented as a layer which augments a PDF with redetected text.

class gmft.pdf_bindings.base.BasePDFDocument

Bases: ABC

close()
abstract get_filename() str
abstract get_page(n: int) BasePage

Get 0-indexed page

class gmft.pdf_bindings.base.BasePage(page_number: int)

Bases: ABC

abstract get_filename() str
abstract get_image(dpi: int = None, rect: Rect = None) Image

Get an image of the page, constrained to be within the given rect. (x0, y0, x1, y1)

abstract get_positions_and_text() Generator[tuple[float, float, float, float, str], None, None]

A generator of text and positions. The tuple is (x0, y0, x1, y1, “string”)

height: float
property page_no
width: float
class gmft.pdf_bindings.base.ImageOnlyPage(img: Image, *, words: list[tuple[float, float, float, float, str]] = None, dpi: int = None)

Bases: BasePage

This is a dummy page that only contains an image.

Parameters:
  • words – Assumes the words provided are in PDF units (dpi=72), not image units.

  • dpi – If provided, will assume the image is an upscaled version taken from the PDF.

close()
classmethod from_page(page: BasePage, dpi: int) ImageOnlyPage

dpi is needed for upscaling

get_filename() str
get_image(dpi: int = None, rect: Rect = None) Image

Gets image. Does downscaling as needed.

Parameters:
  • dpi – resolution. 72 is default. If dpi is not provided, dpi=72 is assumed even if the intrinsic image is higher resolution.

  • rect – if provided, crops to the given rect (in PDF units).

get_positions_and_text() Generator[tuple[float, float, float, float, str], None, None]

This ImageOnlyPage has no text to extract.