Usage
=====
.. _installation:
Installation
------------
First, install `pytorch `_ and transformers with
the desired GPU/CPU setup.
Then, gmft can be installed with pip:
.. code-block:: console
(.venv) $ pip install gmft
Quickstart
----------------
The `quickstart `_ notebook is a good place to start.
To extract many tables, the `bulk extract `_ notebook is a good place to start.
For example,
.. code-block:: python
from gmft.auto import CroppedTable, AutoTableDetector, AutoTableFormatter
from gmft.pdf_bindings import PyPDFium2Document
detector = AutoTableDetector()
formatter = AutoTableFormatter()
def ingest_pdf(pdf_path): # produces list[CroppedTable]
doc = PyPDFium2Document(pdf_path)
tables = []
for page in doc:
tables += detector.extract(page)
return tables, doc
tables, doc = ingest_pdf("path/to/pdf.pdf")
doc.close() # once you're done with the document
Overview
--------
Documents are represented by a :class:`.BasePDFDocument` object. The default implementation is :class:`.PyPDFium2Document`, which uses the `PyPDFium2 `_ library.
Within a document, the :class:`.BasePage` is implemented by default with :class:`.PyPDFium2Page`.
The :class:`.AutoTableDetector` is the recommended table detection tool, which currently uses Microsoft's `Table Transformer `_. They produce :class:`.CroppedTable` objects, from which :meth:`.CroppedTable.image` permits image export.
The :class:`.AutoTableFormatter` is the recommended table formatting tool, from which :meth:`.FormattedTable.df` permits dataframe export. All TableFormatters produce :class:`.FormattedTable` objects, which contain the original CroppedTable and the formatted dataframe.
.. _mupdf:
PyMuPDF
--------
PyMuPDF is the pdf parser of choice, and I recommend `PyMuPDF `_ due to its better performance, accuracy,
and very powerful line break detection feature.
However, PyMuPDF requires compliance with the AGPL-3.0 license, so it is not included in gmft by default.
To use PyMuPDF, refer to the `gmft_pymupdf `_ repository.
Once installed, PyMuPDFDocument can be used in place of PyPDFium2Document.
**Line detection**. Some functionality (like the img2table detector and :ref:`rich table formatting `) depends on quality line detection.
While line detection has been ported to PyPDFium2 via :meth:`.BasePage._get_positions_and_text_and_breaks`,
it is only an imperfect approximation, so extraction won't be as accurate as if PyMuPDF is used.
.. code-block:: bash
pip install git+https://github.com/conjuncts/gmft_pymupdf.git
.. code-block:: python
from gmft_pymupdf import PyMuPDFDocument
doc = PyMuPDFDocument("path/to/pdf")
tables = detector.extract(doc)
# gmft remains unchanged
from gmft.auto import AutoTableDetector
detector = AutoTableDetector()
tables = []
for page in doc:
tables += detector.extract(page)