Skip to main content
Open

PDF Reader: Improve reading of non-tagged tables

Related products:Integrations
mohamad22
debbiatsafe
siennaatsafe
danilo_fme
+10
  • mohamad22
    mohamad22
  • debbiatsafe
    debbiatsafe
  • siennaatsafe
    siennaatsafe
  • danilo_fme
    danilo_fme
  • juandiegoboh
  • gbj1717
    gbj1717
  • jakemolnar
    jakemolnar
  • grattop
    grattop
  • zander
    zander
  • sveouu
    sveouu
  • hellquisttomas
  • krisvesweco
    krisvesweco
  • cofk_stefanie
  • claire.medici
  • karinhoog

PDF Tables are not detected in FME if not tagged as table by an appropriate software. However most of the PDF tables are not tagged correctly, and the PDF reader then does not read them.

On the discussion here: https://knowledge.safe.com/questions/90534/reading-pdf-table.html?childToView=90894#comment-90894

@jakemolnar and @krisvewsp had some ideas to improve this reader. Maybe something to look at?

Would be very useful!


Thanks,

Claire


2 replies

jakemolnar
Forum|alt.badge.img

There are definitely some complex things to consider if FME wants to tackle this problem. For instance, imagine we have a table like this:

 

A1B1C1D1E1A2B2C2D2E2...............ANBNCNDNEN

 

Should hopefully be fairly easy to figure out the spatial relationships, but what if this table is split onto multiple pages? It could be that extra rows are on the next page, but with no header row. Or maybe the next page has a header row. Or maybe the page only fits A, B, and C, so then columns D and E are on the next page.

FME would probably need something similar to the Excel reader settings box, with a variety of parameters for determining questions such as:

 

  • Is there a header row?
    • Does it span multiple pages?
    • Is it repeated on each page
  • Is there more than one table?
    • What is the bounding box of each table?
    • Is it the same for each page?
  • What is the page range that contains the table(s)?
  • What should FME do if the table cells contain vector linework and/or images instead of text? (the linework may even look like text).

 

It would probably be helpful for FME users to think about how what kind of settings they think they'd like in order to read the tables that they work with in their own dataflows. Hopefully they comment here and make some suggestions!


grattop
Contributor
Forum|alt.badge.img
  • Contributor
  • May 17, 2024

Any development on this topic? Reading untagged tables PDF tables using Excel is relatively easy, could something similar be developed in order to read large volumes of PDF untagged tables using FME? 


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings