Question

Read from PDF and rename file

  • 27 March 2020
  • 2 replies
  • 14 views

Badge

I have a need to take stacks of documents and upload them to SharePoint Online and organize them in respective Libraries and Doc Sets.

My idea is to have the 1st page of the scanned document, which is saved as a PDF, read the contents, then rename the file based off the text found on the 1st page. Uploading those files into a Content Organizer, based from file name, will accomplish SharePoint challenge.

Any ideas of: Read content of file, then rename file from that content?


2 replies

Badge +3

I'm not sure whether it is possible in FME to read text from scanned PDF documents (i.e. find text in rasters).

If the PDF file would be a 'non raster' document, you can use an 'adobe geospatial PDF' reader to obtain 'text features' (either per 'block'/'item') or per page. I think for this purpose obtaining text features per page is most useful. Then you could use general test/search conditions to find the text you want, and using a filecopy transformer to copy the document to a new location with a new name (based on the content found)

Badge

I'm not sure whether it is possible in FME to read text from scanned PDF documents (i.e. find text in rasters).

If the PDF file would be a 'non raster' document, you can use an 'adobe geospatial PDF' reader to obtain 'text features' (either per 'block'/'item') or per page. I think for this purpose obtaining text features per page is most useful. Then you could use general test/search conditions to find the text you want, and using a filecopy transformer to copy the document to a new location with a new name (based on the content found)

If @jmhomza needs to get text from raster images (ie. PDFs that are only scans, no text), then they could try the TesseractCaller. It takes a bit of setup, since you need to independently download and install Tesseract OCR (FME can't ship it due to licensing), but it can work to recognize text in images.

Reply