Skip to main content
Open

PDF Reader: Improve reading of non-tagged tables

Related products:Integrations

PDF Tables are not detected in FME if not tagged as table by an appropriate software. However most of the PDF tables are not tagged correctly, and the PDF reader then does not read them.

On the discussion here: https://knowledge.safe.com/questions/90534/reading-pdf-table.html?childToView=90894#comment-90894

@jakemolnar and @krisvewsp had some ideas to improve this reader. Maybe something to look at?

Would be very useful!


Thanks,

Claire


2 replies

jakemolnar
Forum|alt.badge.img

There are definitely some complex things to consider if FME wants to tackle this problem. For instance, imagine we have a table like this:

 

A1B1C1D1E1A2B2C2D2E2...............ANBNCNDNEN

 

Should hopefully be fairly easy to figure out the spatial relationships, but what if this table is split onto multiple pages? It could be that extra rows are on the next page, but with no header row. Or maybe the next page has a header row. Or maybe the page only fits A, B, and C, so then columns D and E are on the next page.

FME would probably need something similar to the Excel reader settings box, with a variety of parameters for determining questions such as:

 

  • Is there a header row?
    • Does it span multiple pages?
    • Is it repeated on each page
  • Is there more than one table?
    • What is the bounding box of each table?
    • Is it the same for each page?
  • What is the page range that contains the table(s)?
  • What should FME do if the table cells contain vector linework and/or images instead of text? (the linework may even look like text).

 

It would probably be helpful for FME users to think about how what kind of settings they think they'd like in order to read the tables that they work with in their own dataflows. Hopefully they comment here and make some suggestions!


grattop
Contributor
Forum|alt.badge.img
  • Contributor
  • May 17, 2024

Any development on this topic? Reading untagged tables PDF tables using Excel is relatively easy, could something similar be developed in order to read large volumes of PDF untagged tables using FME? 


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings