Question

HOW TO EXTRACT INFORMATION FROM PDF

Forum|Forum|6 years ago
May 21, 2019
4 replies
657 views

grazielatm

I am a new FME user and am trying to extract a text information from a PDF. Reading this fórum I came up with this:

Using attribute filter I choose the page number I need to extract that information and the inspector showed me this:

The yellow text is the information I need to extract. How do I do extract that to a excel file or csv?

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

+54

danilo_fme
Celebrity
Forum|Forum|6 years ago
May 21, 2019

Hi @grazielatm

Could you share us the Workspace template ( .fmwt ) or your PDF?

Thanks,

Danilo

Partner Solutial Brazil - www.solutial.com.br

Upvote

G

grazielatm
Author
Forum|Forum|6 years ago
May 21, 2019

@danilo_fme, thank you very much. I am gonna check on that. Once I am a new user, I´m still strugling with FME. :)

Upvote

jakemolnar
Forum|Forum|6 years ago
May 21, 2019

One option that you could try would be to use the `Non-Spatial > Read Non-Spatial Text` mode.

This produces all of the text that can be found for each page, and it may be easier to extract the information you're looking for from that output.

In your case I would expect the feature text to contain lines like:

"X error (cm) Y error (cm) Z error (cm) XY error (cm) Total error (cm)"

"0.275764 0.699132 4.04833 0.751553 4.11799"

You could use an AttributeSplitter to split the lines, and another one to split each line by whitespace.

Upvote

+21

debbiatsafe
Safer
Forum|Forum|6 years ago
May 24, 2019

Hi @grazielatm

The PDF reader has a parameter (under Non-Spatial > Read Tagged Tables) which controls reading tagged tables as a feature type. If a tagged table is present in your PDF, features will be output from the pdf_table feature type.

You may want to confirm whether or not your input dataset contains tagged tables as using this parameter would be the easiest way of extracting the information. You may also want to try decompressing your PDF file as suggested here before reading as this allows the PDF reader to read tagged tables from certain datasets.

As an aside, if you want to see the PDF reader support reading non-tagged tables, please feel free to vote on this Idea

If none of the above suggestions work for you, one workaround would be relating the insertion points of text features to a table cell and extracting the text strings which fall within the cell areas. I have attached an example workspace demonstrating this workflow here: gettextfrompdftable.fmwt

Upvote

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded