Skip to main content
Question

Extracting text from PDF

  • July 20, 2018
  • 8 replies
  • 459 views

I'm looking to extract text from a specific area of a PDF. The box contains IDs and I would like those written to a simple Excel file. My source data is a PDF with roughly 100 pages, and the area I want scanned is in the same location on each page. I believe the best tool to use for this would be the new Adobe Geospatial PDF reader but I am unsure how to proceed from there.

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

8 replies

stalknecht
Contributor
Forum|alt.badge.img+21
  • Contributor
  • 305 replies
  • July 22, 2018

Please share the pdf or a part of it and describe the info you would like to get.


  • Author
  • 7 replies
  • July 22, 2018

Please share the pdf or a part of it and describe the info you would like to get.

Since I do not have permission to share the exact PDF, I have made a figure showing the basic template of each page. I am looking to extract the text circled in red here.

 

 


stalknecht
Contributor
Forum|alt.badge.img+21
  • Contributor
  • 305 replies
  • July 22, 2018
If you open the PDF in the data inspector you will find the Min and Max extents of the part you want.

 

Use a creator to create a box with these extents and use a spatialFilter to get the desired data.

 

 


stalknecht
Contributor
Forum|alt.badge.img+21
  • Contributor
  • 305 replies
  • July 22, 2018

Since I do not have permission to share the exact PDF, I have made a figure showing the basic template of each page. I am looking to extract the text circled in red here.

 

 

If you open the PDF in the data inspector you will find the Min and Max extents of the part you want.

 

Use a creator to create a box with these extents and use a spatialFilter to get the desired data.

 

 


  • Author
  • 7 replies
  • July 22, 2018
If you open the PDF in the data inspector you will find the Min and Max extents of the part you want.

 

Use a creator to create a box with these extents and use a spatialFilter to get the desired data.

 

 

 

Thank you for your quick response. This method has gotten me closer to my goal, but I'm now struggling to take the filtered result and narrow it down to an attribute of the contained text I can write to an excel or CSV file. Since the PDF contains over 100 pages, when I view the filtered result in Data Inspector it is difficult to read as all the data prints over itself.

stalknecht
Contributor
Forum|alt.badge.img+21
  • Contributor
  • 305 replies
  • July 22, 2018

 

Thank you for your quick response. This method has gotten me closer to my goal, but I'm now struggling to take the filtered result and narrow it down to an attribute of the contained text I can write to an excel or CSV file. Since the PDF contains over 100 pages, when I view the filtered result in Data Inspector it is difficult to read as all the data prints over itself.
Just add an excel or csv writer and voila

 

 


  • Author
  • 7 replies
  • July 22, 2018
Just add an excel or csv writer and voila

 

 

The resulting table does not have an attribute with the text contained in the filtered area and I'm not sure how to produce that.

 

 


  • Author
  • 7 replies
  • July 22, 2018
The resulting table does not have an attribute with the text contained in the filtered area and I'm not sure how to produce that.

 

 

 

I was able to pull these out by using AttributeCreator and mapping that to the value of "fme_text_string". Thanks again for your help!