PyPDFium2

Note that this module was formerly called bindings_pdfium. While you may still import from this module, the logic has since been moved to gmft.pdf_bindings.pdfium.

class gmft.pdf_bindings.pdfium.PyPDFium2Document(filename: str)

Bases: BasePDFDocument

Wraps a pdfium.PdfDocument object. Note that you (the user) are responsible for calling doc.close() once you are done, otherwise the document will remain open and consume resources.

close()

Close the document

get_filename() str
get_page(n: int) BasePage

Get 0-indexed page

class gmft.pdf_bindings.pdfium.PyPDFium2Page(page: pypdfium2.PdfPage, filename: str, page_no: int, *, parent: PyPDFium2Document = None)

Bases: BasePage

Note: This follows PIL’s convention of (0, 0) being top left. Therefore, beware: y0 and y1 are flipped from PyPDFium2’s convention.

close()

Not recommended: use close_document instead.

close_document()
get_filename() str
get_image(dpi: int = None, rect: Rect = None) Image

Get an image of the page, constrained to be within the given rect. (x0, y0, x1, y1)

get_positions_and_text() Generator[Tuple[float, float, float, float, str], None, None]

A generator of text and positions. The tuple is (x0, y0, x1, y1, “string”)

Warning: PyPDFium2Page caches the results of this method.

class gmft.pdf_bindings.pdfium.PyPDFium2Utils

Bases: object

Helper class for pypdfium2

static load_page_from_dict(d: dict) BasePage

Helper method to load a BasePage from a serialized CroppedTable or TATRFormattedTable. This method reads a pdf from disk! You will need to close it manually! (through page.close_document())

ie. page.close_document()

static reload(ct: CroppedTable, doc=None) Tuple['CroppedTable', 'PyPDFium2Document']

Reloads the CroppedTable from disk. This is useful for a CroppedTable whose document has been closed.

Parameters: