The first thing to understand is whether the PDFs contain images of the CCTV report, or whether the report is formatted as lines and text. That'll determine how you have to read it.
If they're images, then you'll probably have to use OCR (TesseractCaller) to extract text from the image, figure out the right settings for that, and use the location of the text on the page to classify it into attributes. eg the pipe ID/name, from/to nodes will always be in the same location relative to the page.
If the text on the PDF is stored as text, the PDF reader will be able to pull the text out. The features will have coordinates, so again have testers looking for text in certain coordinates on the page to set what the attributes of the report is.
Complications will be where reports go over multiple pages, and the page layout could be different for each inspection contractor. I'm not sure of an easy way to read the defect information part.
In my experience, also try to get the digital data from the inspector producing the CCTV reports, if that's an option. It may be easier than trying to read the PDFs.
The first thing to understand is whether the PDFs contain images of the CCTV report, or whether the report is formatted as lines and text. That'll determine how you have to read it.
If they're images, then you'll probably have to use OCR (TesseractCaller) to extract text from the image, figure out the right settings for that, and use the location of the text on the page to classify it into attributes. eg the pipe ID/name, from/to nodes will always be in the same location relative to the page.
If the text on the PDF is stored as text, the PDF reader will be able to pull the text out. The features will have coordinates, so again have testers looking for text in certain coordinates on the page to set what the attributes of the report is.
Complications will be where reports go over multiple pages, and the page layout could be different for each inspection contractor. I'm not sure of an easy way to read the defect information part.
In my experience, also try to get the digital data from the inspector producing the CCTV reports, if that's an option. It may be easier than trying to read the PDFs.
Hi @ctredinnick,
Thanks for your reply on this.
(1) Yes, this CCTV report contains images and lines and text.
(2) Do you mean, firstly I have to read the pdf file in FME using " Adobe Geospatial PDF Reader" and then use OCR (TesseractCaller)? Is there any tutorial or example on how to use this OCR?
(3) How do I identify just looking at the pdf that the text in PDF is formatted as in text format or table format? How to look into individual features when we read pdf in fme?
(4) Yes, we have these reports in multiple pages too... :(
Any kind of initial help on above mentioned clarification will be great.
Regards,
Hi @ctredinnick,
Thanks for your reply on this.
(1) Yes, this CCTV report contains images and lines and text.
(2) Do you mean, firstly I have to read the pdf file in FME using " Adobe Geospatial PDF Reader" and then use OCR (TesseractCaller)? Is there any tutorial or example on how to use this OCR?
(3) How do I identify just looking at the pdf that the text in PDF is formatted as in text format or table format? How to look into individual features when we read pdf in fme?
(4) Yes, we have these reports in multiple pages too... :(
Any kind of initial help on above mentioned clarification will be great.
Regards,
Sorry, yes, you'd use the Geospatial PDF Reader in every case to bring the data into FME. Start with a single page. There will be several useful format attributes added like fme_text_string, pdf_page_number. Text is represented as a point location with text at it, with a font size. It'll be useful to use a geometry filter, to get a feel for what all the parts of the PDF are. Text and raster are geometry types, so you can do this to find everything that's text, everything that's an image.
If you need to go down the OCR route -
There's an old blog describing OCR here: https://www.safe.com/blog/2016/10/ocr-for-fme-now-i-know-my-abc/
TesseractCaller (v1) I think works for Tesseract version 3. And TesseractCaller (v2) works for Tesseract version 4. The download page for all Tesseract versions is here: https://digi.bib.uni-mannheim.de/tesseract/
The main parameter for the TesseractCaller is the output, whether the features output represent words, lines, paragraphs or pages. It's not an exact science, you'll have to experiment to get it to work, and then to work right, and possibly even modify what the TesseractCaller does parsing the output from tesseract.