Question

HOW TO EXTRACT INFORMATION FROM PDF

  • 21 May 2019
  • 4 replies
  • 101 views

I am a new FME user and am trying to extract a text information from a PDF. Reading this fórum I came up with this:

Using attribute filter I choose the page number I need to extract that information and the inspector showed me this:

The yellow text is the information I need to extract. How do I do extract that to a excel file or csv?


4 replies

Userlevel 4
Badge +30

Hi @grazielatm

 

Could you share us the Workspace template ( .fmwt ) or your PDF?

 

Thanks,

Danilo

@danilo_fme, thank you very much. I am gonna check on that. Once I am a new user, I´m still strugling with FME. :)

 

Badge

One option that you could try would be to use the `Non-Spatial > Read Non-Spatial Text` mode.

This produces all of the text that can be found for each page, and it may be easier to extract the information you're looking for from that output.

 

In your case I would expect the feature text to contain lines like:

"X error (cm) Y error (cm) Z error (cm) XY error (cm) Total error (cm)"

"0.275764 0.699132 4.04833 0.751553 4.11799"

 

You could use an AttributeSplitter to split the lines, and another one to split each line by whitespace.

Userlevel 3
Badge +17

Hi @grazielatm

The PDF reader has a parameter (under Non-Spatial > Read Tagged Tables) which controls reading tagged tables as a feature type. If a tagged table is present in your PDF, features will be output from the pdf_table feature type.

You may want to confirm whether or not your input dataset contains tagged tables as using this parameter would be the easiest way of extracting the information. You may also want to try decompressing your PDF file as suggested here before reading as this allows the PDF reader to read tagged tables from certain datasets.

As an aside, if you want to see the PDF reader support reading non-tagged tables, please feel free to vote on this Idea

If none of the above suggestions work for you, one workaround would be relating the insertion points of text features to a table cell and extracting the text strings which fall within the cell areas. I have attached an example workspace demonstrating this workflow here: gettextfrompdftable.fmwt

Reply