Skip to main content
Question

Read from PDF and rename file


jmhomza
Contributor
Forum|alt.badge.img+1

I have a need to take stacks of documents and upload them to SharePoint Online and organize them in respective Libraries and Doc Sets.

My idea is to have the 1st page of the scanned document, which is saved as a PDF, read the contents, then rename the file based off the text found on the 1st page. Uploading those files into a Content Organizer, based from file name, will accomplish SharePoint challenge.

Any ideas of: Read content of file, then rename file from that content?

2 replies

thijsknapen
Contributor
Forum|alt.badge.img+10
  • Contributor
  • March 29, 2020

I'm not sure whether it is possible in FME to read text from scanned PDF documents (i.e. find text in rasters).

If the PDF file would be a 'non raster' document, you can use an 'adobe geospatial PDF' reader to obtain 'text features' (either per 'block'/'item') or per page. I think for this purpose obtaining text features per page is most useful. Then you could use general test/search conditions to find the text you want, and using a filecopy transformer to copy the document to a new location with a new name (based on the content found)


jakemolnar
Forum|alt.badge.img
thijsknapen wrote:

I'm not sure whether it is possible in FME to read text from scanned PDF documents (i.e. find text in rasters).

If the PDF file would be a 'non raster' document, you can use an 'adobe geospatial PDF' reader to obtain 'text features' (either per 'block'/'item') or per page. I think for this purpose obtaining text features per page is most useful. Then you could use general test/search conditions to find the text you want, and using a filecopy transformer to copy the document to a new location with a new name (based on the content found)

If @jmhomza needs to get text from raster images (ie. PDFs that are only scans, no text), then they could try the TesseractCaller. It takes a bit of setup, since you need to independently download and install Tesseract OCR (FME can't ship it due to licensing), but it can work to recognize text in images.


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings