Skip to main content
Question

Extracting text from PDF


I'm looking to extract text from a specific area of a PDF. The box contains IDs and I would like those written to a simple Excel file. My source data is a PDF with roughly 100 pages, and the area I want scanned is in the same location on each page. I believe the best tool to use for this would be the new Adobe Geospatial PDF reader but I am unsure how to proceed from there.

8 replies

stalknecht
Contributor
Forum|alt.badge.img+19
  • Contributor
  • July 22, 2018

Please share the pdf or a part of it and describe the info you would like to get.


  • Author
  • July 22, 2018
stalknecht wrote:

Please share the pdf or a part of it and describe the info you would like to get.

Since I do not have permission to share the exact PDF, I have made a figure showing the basic template of each page. I am looking to extract the text circled in red here.

 

 


stalknecht
Contributor
Forum|alt.badge.img+19
  • Contributor
  • July 22, 2018
If you open the PDF in the data inspector you will find the Min and Max extents of the part you want.

 

Use a creator to create a box with these extents and use a spatialFilter to get the desired data.

 

 


stalknecht
Contributor
Forum|alt.badge.img+19
  • Contributor
  • July 22, 2018
amf88 wrote:

Since I do not have permission to share the exact PDF, I have made a figure showing the basic template of each page. I am looking to extract the text circled in red here.

 

 

If you open the PDF in the data inspector you will find the Min and Max extents of the part you want.

 

Use a creator to create a box with these extents and use a spatialFilter to get the desired data.

 

 


  • Author
  • July 22, 2018
stalknecht wrote:
If you open the PDF in the data inspector you will find the Min and Max extents of the part you want.

 

Use a creator to create a box with these extents and use a spatialFilter to get the desired data.

 

 

 

Thank you for your quick response. This method has gotten me closer to my goal, but I'm now struggling to take the filtered result and narrow it down to an attribute of the contained text I can write to an excel or CSV file. Since the PDF contains over 100 pages, when I view the filtered result in Data Inspector it is difficult to read as all the data prints over itself.

stalknecht
Contributor
Forum|alt.badge.img+19
  • Contributor
  • July 22, 2018
amf88 wrote:

 

Thank you for your quick response. This method has gotten me closer to my goal, but I'm now struggling to take the filtered result and narrow it down to an attribute of the contained text I can write to an excel or CSV file. Since the PDF contains over 100 pages, when I view the filtered result in Data Inspector it is difficult to read as all the data prints over itself.
Just add an excel or csv writer and voila

 

 


  • Author
  • July 22, 2018
stalknecht wrote:
Just add an excel or csv writer and voila

 

 

The resulting table does not have an attribute with the text contained in the filtered area and I'm not sure how to produce that.

 

 


  • Author
  • July 22, 2018
amf88 wrote:
The resulting table does not have an attribute with the text contained in the filtered area and I'm not sure how to produce that.

 

 

 

I was able to pull these out by using AttributeCreator and mapping that to the value of "fme_text_string". Thanks again for your help!

Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings