Question

Read from PDF and rename file

Forum|Forum|6 years ago
March 27, 2020
2 replies
162 views

+1

jmhomza
Contributor

I have a need to take stacks of documents and upload them to SharePoint Online and organize them in respective Libraries and Doc Sets.

My idea is to have the 1st page of the scanned document, which is saved as a PDF, read the contents, then rename the file based off the text found on the 1st page. Uploading those files into a Content Organizer, based from file name, will accomplish SharePoint challenge.

Any ideas of: Read content of file, then rename file from that content?

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

+11

thijsknapen
Contributor
Forum|Forum|6 years ago
March 29, 2020

I'm not sure whether it is possible in FME to read text from scanned PDF documents (i.e. find text in rasters).

If the PDF file would be a 'non raster' document, you can use an 'adobe geospatial PDF' reader to obtain 'text features' (either per 'block'/'item') or per page. I think for this purpose obtaining text features per page is most useful. Then you could use general test/search conditions to find the text you want, and using a filecopy transformer to copy the document to a new location with a new name (based on the content found)

Upvote

jakemolnar
Forum|Forum|6 years ago
March 30, 2020

I'm not sure whether it is possible in FME to read text from scanned PDF documents (i.e. find text in rasters).

If the PDF file would be a 'non raster' document, you can use an 'adobe geospatial PDF' reader to obtain 'text features' (either per 'block'/'item') or per page. I think for this purpose obtaining text features per page is most useful. Then you could use general test/search conditions to find the text you want, and using a filecopy transformer to copy the document to a new location with a new name (based on the content found)

If @jmhomza needs to get text from raster images (ie. PDFs that are only scans, no text), then they could try the TesseractCaller. It takes a bit of setup, since you need to independently download and install Tesseract OCR (FME can't ship it due to licensing), but it can work to recognize text in images.

Upvote

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute