Question

The TesseractCaller only returns 75% of the image. Are there any parameters I can adjust to get all of it?

6 years ago
November 6, 2018
10 replies
77 views

fdharris13
4 replies

I am trying to extract text from a table in a PDF image using the TesseractCaller. But only part of the table values are being returned. The text for the document is the same font and size and the contrast is the same for the entire document. Does anyone have any hints or tricks to get tesseract to return the complete table?

+50

redgeographics
Celebrity
3633 replies
6 years ago
November 9, 2018

Is this happening with every image?

+13

nampreetatsafe
Safer
384 replies
6 years ago
November 14, 2018

Hi @fdharris13:

Are you able to share your FME workspace and PDF with us?

fdharris13
Author
4 replies
6 years ago
November 14, 2018

To some degree is happens to every file. The attached example is one of the better outputs.

Marley Park Phase 4&5 Parcels 1A and 1B Address Map.pdf

MP3.fmw

+17

dmitribagh
Safer
104 replies
6 years ago
November 17, 2018

Hi @fdharris13

I tried your file, and got a pretty good result - I believe, I get 100% tesseract

Splitting into words does not work properly in some cases, but rather than that, it's all there.

I found two problems -

1) tesseract 3.05 didn't return any results for hOCR option (maybe the syntax of the command line changed - I didn't check that). My output was made with 3.02

2) With the original size of the image, Tesseract loses some numbers - the first dozen, and then occasionally, it drops a number here and there (for example, it does not see #96).

I reduced the size of the image to 50% of the original and got the picture above. This sounds counter-intuitive, but it helped - why, it's probably a question to Tesseract developers, I can only guess.

I am attaching my workspace. Within the transformer, I made one more change - I checked the option 'Process Duplicate Suppliers' in FeatureMergers.

Tesseract.fmw

I hope this helps.

Dmitri

fdharris13
Author
4 replies
6 years ago
November 19, 2018

Thanks, Dmitri, that works. Tesseract 3.05 does have a issue with the hOCR option but there is another Tesseract question that has a work around. I used both of these answers in my work space with great

results.

gerardor
15 replies
5 years ago
September 10, 2019

Hi @fdharris13 and @dmitribagh,

I am currently trying to work with the TesseractCaller to get Text out of some tiff files, yet I have downloaded the workbench you posted here in the forum as well as the pdf to try and see what output I get but the TesseractCaller always ends up Rejecting Features and doesn't give anoutput. I recently installed the latest version of TesseractCaller version 5 and have upgraded to FME2019. Any help or guidance will work.

Thank you,

Gerardo Rodriguez

+17

dmitribagh
Safer
104 replies
5 years ago
September 10, 2019

gerardor wrote:

Hi @fdharris13 and @dmitribagh,

Thank you,

Gerardo Rodriguez

Hi @gerardor,

I just tried the workspace with the pdf attached to this thread, and everything works correctly. My guess is that Tesseract 5 has some new command line syntax, which leads to rejecting features - the same thing happen with migration from Tesseract 3 to 4. I didn't try v. 5 yet, but this is where I would begin investigation. What error or message do you see in your log file?

Dmitri

gerardor
15 replies
5 years ago
September 10, 2019

dmitribagh wrote:

Hi @gerardor,

Dmitri

Hi @dmitribagh,

so im not very familiar with TesseractCaller, but I did see that if you "edit" a tab opens up next to the Start and Main tabs on the top left of the canvas window. I ran the same workspace you just ran and mine rejects the features in the AttributeFileReader see image below. I also attach the log file. 23443-tesseractcaller

23443-tesseract log file.txt

+17

dmitribagh
Safer
104 replies
5 years ago
September 10, 2019

gerardor wrote:

Hi @dmitribagh,

23443-tesseract log file.txt

Hi @gerardor,

I see the following line in your log file:

TesseractCaller_2_SystemCaller: Failed to Execute `""C:\Users\gerardo.rodriguez\AppData\Local\Tesseract-OCR\tesseract.exe"

So basically, Tesseract never starts, and this is why nothing happens afterwards.

Can you run it from a command line?

Dmitri

gerardor
15 replies
5 years ago
September 12, 2019

dmitribagh wrote:

Hi @gerardor,

I see the following line in your log file:

TesseractCaller_2_SystemCaller: Failed to Execute `""C:\Users\gerardo.rodriguez\AppData\Local\Tesseract-OCR\tesseract.exe"

So basically, Tesseract never starts, and this is why nothing happens afterwards.

Can you run it from a command line?

Dmitri

Hi @dmitribagh,

Unfortunately, the log file I attached was from a different workspace where I was trying to get the tesseract to work. Nevertheless, the log file attached now which is really from the FME Workbench @fdharris13 attached in this forum still has the same line you saw in the other log file. I know you mentioned that Tesseract never starts, but according to the log file the Tesseract is doing something but suddenly gives up and the message you have above appears. If you see the PNG file I attached above will show you where the Tesseract fails.

Unfortunately, I cannot run it in command line due to city policies(Adminstrative password to use command prompt). Any other suggestions or test I can do to see what is causing the Tesseract to execute?

22879-mp3 log file.txt

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

The TesseractCaller only returns 75% of the image. Are there any parameters I can adjust to get all of it?