gmft.pdf_bindings.base

With a common interface, gmft supports interchangeable documents. The key requirements are as follows:

The document should be composed of multiple pages.
The document must expose text with their corresponding locations (bboxes).
The document must expose a way to obtain images for each page.

As a consequence of #2, OCR can be implemented as a layer which augments a PDF with redetected text.

class gmft.pdf_bindings.base.BasePDFDocument

Bases: ABC

class gmft.pdf_bindings.base.BasePage(page_number: int)

Bases: ABC

abstract get_image(dpi: int = None, rect: Rect = None) → Image: Get an image of the page, constrained to be within the given rect. (x0, y0, x1, y1)

abstract get_positions_and_text() → Generator[tuple[float, float, float, float, str], None, None]: A generator of text and positions. The tuple is (x0, y0, x1, y1, “string”)

class gmft.pdf_bindings.base.ImageOnlyPage(img: Image, *, words: list[tuple[float, float, float, float, str]] = None, dpi: int = None)

This is a dummy page that only contains an image.

Parameters:

words – Assumes the words provided are in PDF units (dpi=72), not image units.
dpi – If provided, will assume the image is an upscaled version taken from the PDF.

classmethod from_page(page: BasePage, dpi: int) → ImageOnlyPage: dpi is needed for upscaling

get_image(dpi: int = None, rect: Rect = None) → Image

Gets image. Does downscaling as needed.

Parameters:

dpi – resolution. 72 is default. If dpi is not provided, dpi=72 is assumed even if the intrinsic image is higher resolution.
rect – if provided, crops to the given rect (in PDF units).

get_positions_and_text() → Generator[tuple[float, float, float, float, str], None, None]: This ImageOnlyPage has no text to extract.