Question

How to read PDF files in the form of inspection report containing each line of details and transform those details into table format in FME?

3 years ago
March 30, 2022
3 replies
209 views

ppp19
27 replies

Dear Community,

I want to read few CCTV inspection reports for pipe inspection which contains details about City, asset number, Inspection info from CCTV report, pipe information from CCTV inspection, defect information etc. in different line. I want to read this pdf in FME and convert those details in the table format. May I have a detail tutorial to understand ...how we can do this process in FME?

+19

ctredinnick
Supporter
222 replies
3 years ago
March 31, 2022

The first thing to understand is whether the PDFs contain images of the CCTV report, or whether the report is formatted as lines and text. That'll determine how you have to read it.

If they're images, then you'll probably have to use OCR (TesseractCaller) to extract text from the image, figure out the right settings for that, and use the location of the text on the page to classify it into attributes. eg the pipe ID/name, from/to nodes will always be in the same location relative to the page.

If the text on the PDF is stored as text, the PDF reader will be able to pull the text out. The features will have coordinates, so again have testers looking for text in certain coordinates on the page to set what the attributes of the report is.

Complications will be where reports go over multiple pages, and the page layout could be different for each inspection contractor. I'm not sure of an easy way to read the defect information part.

In my experience, also try to get the digital data from the inspector producing the CCTV reports, if that's an option. It may be easier than trying to read the PDFs.

P

ppp19
Author
27 replies
3 years ago
March 31, 2022

ctredinnick wrote:

The first thing to understand is whether the PDFs contain images of the CCTV report, or whether the report is formatted as lines and text. That'll determine how you have to read it.

If they're images, then you'll probably have to use OCR (TesseractCaller) to extract text from the image, figure out the right settings for that, and use the location of the text on the page to classify it into attributes. eg the pipe ID/name, from/to nodes will always be in the same location relative to the page.

If the text on the PDF is stored as text, the PDF reader will be able to pull the text out. The features will have coordinates, so again have testers looking for text in certain coordinates on the page to set what the attributes of the report is.

Complications will be where reports go over multiple pages, and the page layout could be different for each inspection contractor. I'm not sure of an easy way to read the defect information part.

In my experience, also try to get the digital data from the inspector producing the CCTV reports, if that's an option. It may be easier than trying to read the PDFs.

Hi @ctredinnick,

Thanks for your reply on this.

(1) Yes, this CCTV report contains images and lines and text.

(2) Do you mean, firstly I have to read the pdf file in FME using " Adobe Geospatial PDF Reader" and then use OCR (TesseractCaller)? Is there any tutorial or example on how to use this OCR?

(3) How do I identify just looking at the pdf that the text in PDF is formatted as in text format or table format? How to look into individual features when we read pdf in fme?

(4) Yes, we have these reports in multiple pages too... :(

Any kind of initial help on above mentioned clarification will be great.

Regards,

+19

ctredinnick
Supporter
222 replies
3 years ago
March 31, 2022

ppp19 wrote:

Hi @ctredinnick,

Thanks for your reply on this.

(1) Yes, this CCTV report contains images and lines and text.

(2) Do you mean, firstly I have to read the pdf file in FME using " Adobe Geospatial PDF Reader" and then use OCR (TesseractCaller)? Is there any tutorial or example on how to use this OCR?

(3) How do I identify just looking at the pdf that the text in PDF is formatted as in text format or table format? How to look into individual features when we read pdf in fme?

(4) Yes, we have these reports in multiple pages too... :(

Any kind of initial help on above mentioned clarification will be great.

Regards,

Sorry, yes, you'd use the Geospatial PDF Reader in every case to bring the data into FME. Start with a single page. There will be several useful format attributes added like fme_text_string, pdf_page_number. Text is represented as a point location with text at it, with a font size. It'll be useful to use a geometry filter, to get a feel for what all the parts of the PDF are. Text and raster are geometry types, so you can do this to find everything that's text, everything that's an image.

If you need to go down the OCR route -

There's an old blog describing OCR here: https://www.safe.com/blog/2016/10/ocr-for-fme-now-i-know-my-abc/

TesseractCaller (v1) I think works for Tesseract version 3. And TesseractCaller (v2) works for Tesseract version 4. The download page for all Tesseract versions is here: https://digi.bib.uni-mannheim.de/tesseract/

The main parameter for the TesseractCaller is the output, whether the features output represent words, lines, paragraphs or pages. It's not an exact science, you'll have to experiment to get it to work, and then to work right, and possibly even modify what the TesseractCaller does parsing the output from tesseract.

How to read PDF files in the form of inspection report containing each line of details and transform those details into table format in FME?

3 replies

Reply

Helpful Members This Week

Recently Solved Questions

FME 2025.1 PythonCaller can't run arcpy?

Tag unknown # features with ID from a previous record

How to set a "reply_to" parameter in flow automation action "email send"

AttributeValidator Pass Nulls

AGOL attachment not being recognized by Emailer

Community Stats

Latest FME

Cookie policy

Cookie settings

Reply

Related Topics

AreaOnAreaOverlayer: Unable to change default column name of merged data within the transformer and unable to manually connect that column to a new column name in the output file.icon

How to dynamically populate in a published parameter based on the output of an attribute column?icon

How to use StringSearcher to find today's date in a listicon

How to Write Metadata To MapInfo Tab Fileicon

How to remove rows with only NULL Valuesicon

Helpful Members This Week

Recently Solved Questions

FME 2025.1 PythonCaller can't run arcpy?

Tag unknown # features with ID from a previous record

How to set a "reply_to" parameter in flow automation action "email send"

AttributeValidator Pass Nulls

AGOL attachment not being recognized by Emailer

Popular Tags

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings