Skip to main content
Question

The TesseractCaller only returns 75% of the image. Are there any parameters I can adjust to get all of it?


Forum|alt.badge.img

I am trying to extract text from a table in a PDF image using the TesseractCaller. But only part of the table values are being returned. The text for the document is the same font and size and the contrast is the same for the entire document. Does anyone have any hints or tricks to get tesseract to return the complete table?

10 replies

redgeographics
Celebrity
Forum|alt.badge.img+50

Is this happening with every image?


nampreetatsafe
Safer
Forum|alt.badge.img+13

Hi @fdharris13:

Are you able to share your FME workspace and PDF with us?


Forum|alt.badge.img
  • Author
  • November 14, 2018

To some degree is happens to every file. The attached example is one of the better outputs.

Marley Park Phase 4&5 Parcels 1A and 1B Address Map.pdf

MP3.fmw


dmitribagh
Safer
Forum|alt.badge.img+17
  • Safer
  • November 17, 2018

Hi @fdharris13

I tried your file, and got a pretty good result - I believe, I get 100%tesseract

Splitting into words does not work properly in some cases, but rather than that, it's all there.

 

I found two problems -

 

1) tesseract 3.05 didn't return any results for hOCR option (maybe the syntax of the command line changed - I didn't check that). My output was made with 3.02

 

 

2) With the original size of the image, Tesseract loses some numbers - the first dozen, and then occasionally, it drops a number here and there (for example, it does not see #96).

 

 

I reduced the size of the image to 50% of the original and got the picture above. This sounds counter-intuitive, but it helped - why, it's probably a question to Tesseract developers, I can only guess.

 

 

I am attaching my workspace. Within the transformer, I made one more change - I checked the option 'Process Duplicate Suppliers' in FeatureMergers.

 

 

Tesseract.fmw

I hope this helps.

 

 

Dmitri


Forum|alt.badge.img
  • Author
  • November 19, 2018

Thanks, Dmitri, that works. Tesseract 3.05 does have a issue with the hOCR option but there is another Tesseract question that has a work around. I used both of these answers in my work space with great

results.


Forum|alt.badge.img
  • September 10, 2019

Hi @fdharris13 and @dmitribagh,

I am currently trying to work with the TesseractCaller to get Text out of some tiff files, yet I have downloaded the workbench you posted here in the forum as well as the pdf to try and see what output I get but the TesseractCaller always ends up Rejecting Features and doesn't give anoutput. I recently installed the latest version of TesseractCaller version 5 and have upgraded to FME2019. Any help or guidance will work.

Thank you,

Gerardo Rodriguez


dmitribagh
Safer
Forum|alt.badge.img+17
  • Safer
  • September 10, 2019
gerardor wrote:

Hi @fdharris13 and @dmitribagh,

I am currently trying to work with the TesseractCaller to get Text out of some tiff files, yet I have downloaded the workbench you posted here in the forum as well as the pdf to try and see what output I get but the TesseractCaller always ends up Rejecting Features and doesn't give anoutput. I recently installed the latest version of TesseractCaller version 5 and have upgraded to FME2019. Any help or guidance will work.

Thank you,

Gerardo Rodriguez

Hi @gerardor,

I just tried the workspace with the pdf attached to this thread, and everything works correctly. My guess is that Tesseract 5 has some new command line syntax, which leads to rejecting features - the same thing happen with migration from Tesseract 3 to 4. I didn't try v. 5 yet, but this is where I would begin investigation. What error or message do you see in your log file?

 

Dmitri

 


Forum|alt.badge.img
  • September 10, 2019
dmitribagh wrote:

Hi @gerardor,

I just tried the workspace with the pdf attached to this thread, and everything works correctly. My guess is that Tesseract 5 has some new command line syntax, which leads to rejecting features - the same thing happen with migration from Tesseract 3 to 4. I didn't try v. 5 yet, but this is where I would begin investigation. What error or message do you see in your log file?

 

Dmitri

 

Hi @dmitribagh,

so im not very familiar with TesseractCaller, but I did see that if you "edit" a tab opens up next to the Start and Main tabs on the top left of the canvas window. I ran the same workspace you just ran and mine rejects the features in the AttributeFileReader see image below. I also attach the log file.23443-tesseractcaller

23443-tesseract log file.txt

 


dmitribagh
Safer
Forum|alt.badge.img+17
  • Safer
  • September 10, 2019
gerardor wrote:

Hi @dmitribagh,

so im not very familiar with TesseractCaller, but I did see that if you "edit" a tab opens up next to the Start and Main tabs on the top left of the canvas window. I ran the same workspace you just ran and mine rejects the features in the AttributeFileReader see image below. I also attach the log file.23443-tesseractcaller

23443-tesseract log file.txt

 

Hi @gerardor, 

I see the following line in your log file:

TesseractCaller_2_SystemCaller: Failed to Execute `""C:\Users\gerardo.rodriguez\AppData\Local\Tesseract-OCR\tesseract.exe

So basically, Tesseract never starts, and this is why nothing happens afterwards.

 

Can you run it from a command line?

Dmitri 

 

 


Forum|alt.badge.img
  • September 12, 2019
dmitribagh wrote:

Hi @gerardor, 

I see the following line in your log file:

TesseractCaller_2_SystemCaller: Failed to Execute `""C:\Users\gerardo.rodriguez\AppData\Local\Tesseract-OCR\tesseract.exe

So basically, Tesseract never starts, and this is why nothing happens afterwards.

 

Can you run it from a command line?

Dmitri 

 

 

Hi @dmitribagh,

Unfortunately, the log file I attached was from a different workspace where I was trying to get the tesseract to work. Nevertheless, the log file attached now which is really from the FME Workbench @fdharris13 attached in this forum still has the same line you saw in the other log file. I know you mentioned that Tesseract never starts, but according to the log file the Tesseract is doing something but suddenly gives up and the message you have above appears. If you see the PNG file I attached above will show you where the Tesseract fails.

Unfortunately, I cannot run it in command line due to city policies(Adminstrative password to use command prompt). Any other suggestions or test I can do to see what is causing the Tesseract to execute?

 

22879-mp3 log file.txt


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings