Histogram based ================ HistogramFormatter uses the bboxes of words, interprets them as intervals, then populates a histogram-like structure. Locations where there is almost no overlap can be detected, and from these separating lines the table can be deduced. .. image:: ../images/histogram_expl.png :alt: histogram based :align: center When to use? ------------- HistogramFormatter is extremely fast, and so it is recommended as a starting point. With very large tables, TATRFormatter and DITRFormatter struggle, and so often HistogramFormatter will be your best output. To know when this is the case, we can simply count the number of separators lines directly. In this case, the very fast histogram-backed algorithm means that the slow structure recognition step can be skipped entirely. - DETR is hard capped at 100 detected objects, so it does worse approaching that limit. HistogramFormatter works less well for tables with multi-line content or subscripts and superscripts. In this case, DITRFormatter will be able to semantically read the table. Recommendation: 1. Run HistogramFormatter 2. If # of row separators + # of column separators < some threshold (ie. 60), run DITRFormatter .. code-block:: python from gmft.formatters.histogram import HistogramFormatter formatter = HistogramFormatter() formatted_tables = [formatter.format(table) for table in tables] so the HistogramFormatter is . Mix and Match -------------- HistogramFormatter produces separating lines; so does DITRFormatter. Thus, you can mix and match the separating lines from both if one method is more accurate than the other. For instance, if the HistogramFormatter method works well for rows but struggles on the columns, you can mix HistogramFormatter's the row separators and DITRFormatter's column separators.