Question

Reading PDF Table


Badge

Hello,

 

I am struggling to read a very simple table from a PDF, I just want to have something in the output.

I have turn on only the "Read Tagged Tables" in the Reader's parameters, but nothing is coming out from it. The reader is not being created in my window. I have created a special table from excel in PDF to be sure that the table was fine.

It looks simple in the FME presentation... https://www.safe.com/convert/geospatial-pdf/excel/

 

Could someone git me a hint about what am I doing wrong?

 

Thanks!

 

My simple PDF with its readable table:

 

My Reading parameters:


15 replies

Badge

Can you attach the PDF, or a PDF you made by using the same process?

Maybe the version of Excel you are using is not creating a "tagged" table. For instance, Excel for MacOS doesn't create tagged tables.

Your screenshots suggest a Windows environment, so it's probably not the same problem. Still, I could probably figure out what was going on by taking a look at the PDF: I'm very familiar with their underlying structure.

Badge

Hello, thanks for the reply.

Yes no problem here is the PDF I am trying to read as an example. I have created it in a Windows 7 environment and with the latest version of Excel. testxls.pdf

Badge

Hello, thanks for the reply.

Yes no problem here is the PDF I am trying to read as an example. I have created it in a Windows 7 environment and with the latest version of Excel. testxls.pdf

A quick visual scan of the inner contents shows that the structures describing a table are there, so it will take a little more investigation to figure out why FME isn't recognizing it.

 

It's possible that the Excel team changed how they lay out the table metadata, or perhaps some other issue (for instance, maybe the characters are drawn instead of printed, which can make it hard for FME to read).

Badge

EDIT: There seems to (at least sometimes) be a problem with the tool FME uses to read the table structure when the PDF is highly compressed. You can work around this problem by searching for a tool that "decompresses PDF streams", such as MuPDF for instance, or an online service such as this.

26559-testxls.ffs.zip

decompressed version of PDF: 26559-testxls.decompressed.pdf

I was able to read the table when I tried edit: a decompressed version of your PDF. I've attached my output, as well as some pictures of my workspace (just reading from PDF and writing to FFS) and reader settings.

Workspace:

Settings:

 

Badge

EDIT: There seems to (at least sometimes) be a problem with the tool FME uses to read the table structure when the PDF is highly compressed. You can work around this problem by searching for a tool that "decompresses PDF streams", such as MuPDF for instance, or an online service such as this.

26559-testxls.ffs.zip

decompressed version of PDF: 26559-testxls.decompressed.pdf

I was able to read the table when I tried edit: a decompressed version of your PDF. I've attached my output, as well as some pictures of my workspace (just reading from PDF and writing to FFS) and reader settings.

Workspace:

Settings:

 

Hi @jakemolnar,

I'm intrigued, what version are you able to open the PDF uploaded by @claire.medici and successfully read the table as a Feature Type? I've tried in 2018.1.1.1, 2019.0, 2019.0.0.1 and 2019.1 Beta (using your settings each time) and it just won't put the pdf_table feature type on the canvas!

We have a customer who is experiencing the exact same issue, so would be good to get to the bottom of this!

Simon

Badge

Hi @jakemolnar,

I'm intrigued, what version are you able to open the PDF uploaded by @claire.medici and successfully read the table as a Feature Type? I've tried in 2018.1.1.1, 2019.0, 2019.0.0.1 and 2019.1 Beta (using your settings each time) and it just won't put the pdf_table feature type on the canvas!

We have a customer who is experiencing the exact same issue, so would be good to get to the bottom of this!

Simon

Hm, well that's interesting! I realize now that I may have slightly modified the PDF when I decompressed it to read it in a text editor. Also, maybe this is a platform issue: I was using FME 2019.1 Beta on MacOS.

Will give it another try with the untouched PDF and on Windows.

Badge

Hm, well that's interesting! I realize now that I may have slightly modified the PDF when I decompressed it to read it in a text editor. Also, maybe this is a platform issue: I was using FME 2019.1 Beta on MacOS.

Will give it another try with the untouched PDF and on Windows.

Ah, definitely seems to be something to do with the stream decompression: the untouched PDF doesn't work. Investigating.

Badge

EDIT: There seems to (at least sometimes) be a problem with the tool FME uses to read the table structure when the PDF is highly compressed. You can work around this problem by searching for a tool that "decompresses PDF streams", such as MuPDF for instance, or an online service such as this.

26559-testxls.ffs.zip

decompressed version of PDF: 26559-testxls.decompressed.pdf

I was able to read the table when I tried edit: a decompressed version of your PDF. I've attached my output, as well as some pictures of my workspace (just reading from PDF and writing to FFS) and reader settings.

Workspace:

Settings:

 

Thanks @jakemolnar

I manage to make the decompression and it's working with my sample pdf! Thanks!

However I have a "real" PDF I needed to convert and even decompressed no tables will come out.

I believe the tables weren't build the correct way, but looks weird that FME is boasting that even with scanned pdf it's working I was wandering how?

"Have a scanned PDF table to convert? No worries, FME can work with that too using OCR and digitizing transformers directly within your workspace! "

I can't share my needed pdf as it is private data, but I was wondering if you had any idea how to get around this? The table is readable in my PDF so no OCR would be needed though.

Badge

Thanks @jakemolnar

I manage to make the decompression and it's working with my sample pdf! Thanks!

However I have a "real" PDF I needed to convert and even decompressed no tables will come out.

I believe the tables weren't build the correct way, but looks weird that FME is boasting that even with scanned pdf it's working I was wandering how?

"Have a scanned PDF table to convert? No worries, FME can work with that too using OCR and digitizing transformers directly within your workspace! "

I can't share my needed pdf as it is private data, but I was wondering if you had any idea how to get around this? The table is readable in my PDF so no OCR would be needed though.

If you're curious about how FME can work with OCR, here's an article that demonstrates a workflow: https://www.safe.com/blog/2016/10/ocr-for-fme-now-i-know-my-abc/

 

If your PDF is already text and has a regular layout, you can use text feature bounding boxes to figure out which table cells each corresponds to, but I admit this is a very laborious process and tends to differ from PDF to PDF.

 

It's possible for FME's PDF Reader to try to automate this in the future: if you create an Idea post for a PDF Reader enhancement (or maybe something like a "TableRelater" transformer), then I'm sure the developers will respond.

Badge

If you're curious about how FME can work with OCR, here's an article that demonstrates a workflow: https://www.safe.com/blog/2016/10/ocr-for-fme-now-i-know-my-abc/

 

If your PDF is already text and has a regular layout, you can use text feature bounding boxes to figure out which table cells each corresponds to, but I admit this is a very laborious process and tends to differ from PDF to PDF.

 

It's possible for FME's PDF Reader to try to automate this in the future: if you create an Idea post for a PDF Reader enhancement (or maybe something like a "TableRelater" transformer), then I'm sure the developers will respond.

Ok @jakemolnar Thanks for your answer, and all your help on this topic!

Badge

Ok @jakemolnar Thanks for your answer, and all your help on this topic!

No problem :)

Badge +4

I have also been struggling with extracting tables in a automatic and stable manner. My pdf files comes from excel but also Word and other text processing software and they are not very often picked up by the current PDFReader in FME. The reason for this is that not all tables are tagged as tables(as when printing correctly from Excel). I tried a few other tools such as Tabulapdf but settled for Camelot which is quite a new player but works for most of my pdf files. It's a python library which can be found here:

https://github.com/socialcopsdev/camelot

 

I made a little python script that locates the tables and writes them to csv files. I compiled the python script to a CLI exe-file (using pyinstaller) which I can call from FME.

Maybe we can vote for this tool to be built into FME at som point. A more powerful table extraction method would be very helpful since its a world full of pdf files we are all living in. :)

Badge

I have also been struggling with extracting tables in a automatic and stable manner. My pdf files comes from excel but also Word and other text processing software and they are not very often picked up by the current PDFReader in FME. The reason for this is that not all tables are tagged as tables(as when printing correctly from Excel). I tried a few other tools such as Tabulapdf but settled for Camelot which is quite a new player but works for most of my pdf files. It's a python library which can be found here:

https://github.com/socialcopsdev/camelot

 

I made a little python script that locates the tables and writes them to csv files. I compiled the python script to a CLI exe-file (using pyinstaller) which I can call from FME.

Maybe we can vote for this tool to be built into FME at som point. A more powerful table extraction method would be very helpful since its a world full of pdf files we are all living in. :)

I'd be quite happy to vote/comment on an Idea post like that too; maybe you should make one and link it here!

Badge

I have also been struggling with extracting tables in a automatic and stable manner. My pdf files comes from excel but also Word and other text processing software and they are not very often picked up by the current PDFReader in FME. The reason for this is that not all tables are tagged as tables(as when printing correctly from Excel). I tried a few other tools such as Tabulapdf but settled for Camelot which is quite a new player but works for most of my pdf files. It's a python library which can be found here:

https://github.com/socialcopsdev/camelot

 

I made a little python script that locates the tables and writes them to csv files. I compiled the python script to a CLI exe-file (using pyinstaller) which I can call from FME.

Maybe we can vote for this tool to be built into FME at som point. A more powerful table extraction method would be very helpful since its a world full of pdf files we are all living in. :)

I have posted an idea here: https://knowledge.safe.com/idea/90905/read-non-tagged-tables-in-pdfs.html?

Do not hesitate to add comments, and explanations to it.

Badge

I have posted an idea here: https://knowledge.safe.com/idea/90905/read-non-tagged-tables-in-pdfs.html?

Do not hesitate to add comments, and explanations to it.

Thanks Claire!

Reply