Passing into LLMs
Since processing tables takes a nontrivial amount of time, I recommend converting all pdf documents into text files long before embedding them and passing them into LLMs.
Any text format can be used, but for the best performance, I recommend markdown or latex. For your convenience, an experimental method automatically converts a pdf document into a format suitable for LLMs. The method embed_tables produces the pdf content interspersed with tables in markdown-like format.
Note that the presence of line breaks depends on a heuristic which works best with PyMuPDF. While line breaks have been ported to PyPDFium2Document, the approximation is imperfect. See the PyMuPDF section for more information.
from gmft.formatters.page.embed import embed_tables
doc = PyMuPDFDocument("data/pdfs/7.pdf") # PyMuPDF is preferred
# PyPdfium2 is possible, but line breaks may be less accurate
# doc = PyPDFium2Document("path/to/pdf")
rich_pages = embed_tables(doc=doc, tables=pdf7_tables)
assert rich_pages[3].get_text().startswith("""Infectious of Alestig al. BMC Diseases Page 2011, 11:124 4 7 et
http://www.biomedcentral.com/1471-2334/11/124
Table and viral baseline with and without in patients Host 2 parameters treatment response
| | | SVR n = 29 | non-SVR n = 21 | Univariate p value |
|---:|:------------------------------------|:---------------------|:--------------------|:---------------------|
| 0 | Age (mean) | 45.2 | 48.8 | 0.09a |
| 1 | Number of patients < 45 / > 45 yrs | 11 / 18 | 4 / 17 | 0.21b |
| 2 | Gender (m/f) | 17 / 12 | 13 / 8 | 1.0b |
| 3 | Baseline HCV RNA (mean log IU/mL) | 6.37 | 6.59 | 0.56a |
| 4 | Number with < 5.6 / > 5.6 log IU/mL | 8 / 21 | 0 / 21 | 0.01b |
| 5 | Genotype 1a/1b | 21 / 8 | 16 / 5 | 1.0b |
| 6 | Fibrosis (F0/F1/F2/F3/F4)c | 0 / 10 / 13 / 4 / 0 | 2 / 4 / 4 / 7 / 2 | 0.19d |
| 7 | Core aa 70 | 28 R / 1 Q | 15 R / 5 Q & 1 P | 0.03b |
| 8 | Core aa 91 | 21 C / 6 M / 2 L | 16 C / 3 M / 2 L | 0.82e |
| 9 | rs12979860 | 16 CC / 13 CT / 0 TT | 2 CC / 11 CT / 8 TT | 0.0001e |
a Mann-Whitney U test.
b Fisher’s exact test.
c for Fibrosis scored according Ludwig and and available Batts, patients. 34 to was was
d Logistic regression.
e Chi test. square
the correlation of responders Subgenotypes, and mutations strains, 8 7 treatment strong: core response was
substitu\ufffetions The virologic associated with (R70) had and non-responders had glutamine 5 arginine not response was
residue (Q70) (p 0.005). residue the However, 91. In in 37 70 at at contrast, asso\ufffeciated poor response was a =
with substitutions of residue of the with infection, all the with One SVR 70: patients patients 1a 21 core
(14%) (six with substitutions residue carried with while of the patients HCV non-SVR 70 7 R70, 15 16 at car\uffferied
subtype 1b with and subtype with Q70 strains strain 1a R70. strains one
P70) with""")
Example taken from the corresponding internal test.
Which format is best?
For the simple task of matching a cell to its header, performance is (best to worst):
markdown ~ latex ~ json > html >> csv_plus* >> csv ~ tsv
(Only OpenAI models were tested. csv_plus is csv, but with an extra space after each comma. The improvement in performance might be attributable to better tokenization.)