gmft.pdf_bindings

These classes are aliased through this module: BasePage, BasePDFDocument, ImageOnlyPage, PyPDFium2Page, PyPDFium2Document.

class gmft.pdf_bindings.BasePDFDocument

Bases: ABC

close()
abstract get_filename() str
abstract get_page(n: int) BasePage

Get 0-indexed page

class gmft.pdf_bindings.BasePage(page_number: int)

Bases: ABC

abstract get_filename() str
abstract get_image(dpi: int = None, rect: Rect = None) Image

Get an image of the page, constrained to be within the given rect. (x0, y0, x1, y1)

abstract get_positions_and_text() Generator[tuple[float, float, float, float, str], None, None]

A generator of text and positions. The tuple is (x0, y0, x1, y1, “string”)

height: float
property page_no
width: float
class gmft.pdf_bindings.ImageOnlyPage(img: Image, *, words: list[tuple[float, float, float, float, str]] = None, dpi: int = None)

Bases: BasePage

This is a dummy page that only contains an image.

Parameters:
  • words – Assumes the words provided are in PDF units (dpi=72), not image units.

  • dpi – If provided, will assume the image is an upscaled version taken from the PDF.

close()
classmethod from_page(page: BasePage, dpi: int) ImageOnlyPage

dpi is needed for upscaling

get_filename() str
get_image(dpi: int = None, rect: Rect = None) Image

Gets image. Does downscaling as needed.

Parameters:
  • dpi – resolution. 72 is default. If dpi is not provided, dpi=72 is assumed even if the intrinsic image is higher resolution.

  • rect – if provided, crops to the given rect (in PDF units).

get_positions_and_text() Generator[tuple[float, float, float, float, str], None, None]

This ImageOnlyPage has no text to extract.

class gmft.pdf_bindings.PyPDFium2Document(filename: str)

Bases: BasePDFDocument

Wraps a pdfium.PdfDocument object. Note that you (the user) are responsible for calling doc.close() once you are done, otherwise the document will remain open and consume resources.

close()

Close the document

get_filename() str
get_page(n: int) BasePage

Get 0-indexed page

class gmft.pdf_bindings.PyPDFium2Page(page: pypdfium2.PdfPage, filename: str, page_no: int, *, parent: PyPDFium2Document = None)

Bases: BasePage

Note: This follows PIL’s convention of (0, 0) being top left. Therefore, beware: y0 and y1 are flipped from PyPDFium2’s convention.

close()

Not recommended: use close_document instead.

close_document()
get_filename() str
get_image(dpi: int = None, rect: Rect = None) Image

Get an image of the page, constrained to be within the given rect. (x0, y0, x1, y1)

get_positions_and_text() Generator[Tuple[float, float, float, float, str], None, None]

A generator of text and positions. The tuple is (x0, y0, x1, y1, “string”)

Warning: PyPDFium2Page caches the results of this method.