There are definitely some complex things to consider if FME wants to tackle this problem. For instance, imagine we have a table like this:
A1B1C1D1E1A2B2C2D2E2...............ANBNCNDNEN
Should hopefully be fairly easy to figure out the spatial relationships, but what if this table is split onto multiple pages? It could be that extra rows are on the next page, but with no header row. Or maybe the next page has a header row. Or maybe the page only fits A, B, and C, so then columns D and E are on the next page.
FME would probably need something similar to the Excel reader settings box, with a variety of parameters for determining questions such as:
- Is there a header row?
- Does it span multiple pages?
- Is it repeated on each page
- Is there more than one table?
- What is the bounding box of each table?
- Is it the same for each page?
- What is the page range that contains the table(s)?
- What should FME do if the table cells contain vector linework and/or images instead of text? (the linework may even look like text).
It would probably be helpful for FME users to think about how what kind of settings they think they'd like in order to read the tables that they work with in their own dataflows. Hopefully they comment here and make some suggestions!
Any development on this topic? Reading untagged tables PDF tables using Excel is relatively easy, could something similar be developed in order to read large volumes of PDF untagged tables using FME?