I am trying to extract text from a table in a PDF image using the TesseractCaller. But only part of the table values are being returned. The text for the document is the same font and size and the contrast is the same for the entire document. Does anyone have any hints or tricks to get tesseract to return the complete table?
Is this happening with every image?
Hi @fdharris13:
Are you able to share your FME workspace and PDF with us?
To some degree is happens to every file. The attached example is one of the better outputs.
Marley Park Phase 4&5 Parcels 1A and 1B Address Map.pdf
Hi @fdharris13
I tried your file, and got a pretty good result - I believe, I get 100%
Splitting into words does not work properly in some cases, but rather than that, it's all there.
I found two problems -
1) tesseract 3.05 didn't return any results for hOCR option (maybe the syntax of the command line changed - I didn't check that). My output was made with 3.02
2) With the original size of the image, Tesseract loses some numbers - the first dozen, and then occasionally, it drops a number here and there (for example, it does not see #96).
I reduced the size of the image to 50% of the original and got the picture above. This sounds counter-intuitive, but it helped - why, it's probably a question to Tesseract developers, I can only guess.
I am attaching my workspace. Within the transformer, I made one more change - I checked the option 'Process Duplicate Suppliers' in FeatureMergers.
I hope this helps.
Dmitri
Thanks, Dmitri, that works. Tesseract 3.05 does have a issue with the hOCR option but there is another Tesseract question that has a work around. I used both of these answers in my work space with great
results.
Hi @fdharris13 and @dmitribagh,
I am currently trying to work with the TesseractCaller to get Text out of some tiff files, yet I have downloaded the workbench you posted here in the forum as well as the pdf to try and see what output I get but the TesseractCaller always ends up Rejecting Features and doesn't give anoutput. I recently installed the latest version of TesseractCaller version 5 and have upgraded to FME2019. Any help or guidance will work.
Thank you,
Gerardo Rodriguez
Hi @fdharris13 and @dmitribagh,
I am currently trying to work with the TesseractCaller to get Text out of some tiff files, yet I have downloaded the workbench you posted here in the forum as well as the pdf to try and see what output I get but the TesseractCaller always ends up Rejecting Features and doesn't give anoutput. I recently installed the latest version of TesseractCaller version 5 and have upgraded to FME2019. Any help or guidance will work.
Thank you,
Gerardo Rodriguez
Hi @gerardor,
I just tried the workspace with the pdf attached to this thread, and everything works correctly. My guess is that Tesseract 5 has some new command line syntax, which leads to rejecting features - the same thing happen with migration from Tesseract 3 to 4. I didn't try v. 5 yet, but this is where I would begin investigation. What error or message do you see in your log file?
Dmitri
Hi @gerardor,
I just tried the workspace with the pdf attached to this thread, and everything works correctly. My guess is that Tesseract 5 has some new command line syntax, which leads to rejecting features - the same thing happen with migration from Tesseract 3 to 4. I didn't try v. 5 yet, but this is where I would begin investigation. What error or message do you see in your log file?
Dmitri
Hi @dmitribagh,
so im not very familiar with TesseractCaller, but I did see that if you "edit" a tab opens up next to the Start and Main tabs on the top left of the canvas window. I ran the same workspace you just ran and mine rejects the features in the AttributeFileReader see image below. I also attach the log file.
Hi @dmitribagh,
so im not very familiar with TesseractCaller, but I did see that if you "edit" a tab opens up next to the Start and Main tabs on the top left of the canvas window. I ran the same workspace you just ran and mine rejects the features in the AttributeFileReader see image below. I also attach the log file.
Hi @gerardor,
I see the following line in your log file:
TesseractCaller_2_SystemCaller: Failed to Execute `""C:\Users\gerardo.rodriguez\AppData\Local\Tesseract-OCR\tesseract.exe"
So basically, Tesseract never starts, and this is why nothing happens afterwards.
Can you run it from a command line?
Dmitri
Hi @gerardor,
I see the following line in your log file:
TesseractCaller_2_SystemCaller: Failed to Execute `""C:\Users\gerardo.rodriguez\AppData\Local\Tesseract-OCR\tesseract.exe"
So basically, Tesseract never starts, and this is why nothing happens afterwards.
Can you run it from a command line?
Dmitri
Hi @dmitribagh,
Unfortunately, the log file I attached was from a different workspace where I was trying to get the tesseract to work. Nevertheless, the log file attached now which is really from the FME Workbench @fdharris13 attached in this forum still has the same line you saw in the other log file. I know you mentioned that Tesseract never starts, but according to the log file the Tesseract is doing something but suddenly gives up and the message you have above appears. If you see the PNG file I attached above will show you where the Tesseract fails.
Unfortunately, I cannot run it in command line due to city policies(Adminstrative password to use command prompt). Any other suggestions or test I can do to see what is causing the Tesseract to execute?