I am trying to extract text from a table in a PDF image using the TesseractCaller. But only part of the table values are being returned. The text for the document is the same font and size and the contrast is the same for the entire document. Does anyone have any hints or tricks to get tesseract to return the complete table?
- Home
- Forums
- FME Form
- Transformers
- The TesseractCaller only returns 75% of the image. Are there any parameters I can adjust to get all of it?
The TesseractCaller only returns 75% of the image. Are there any parameters I can adjust to get all of it?
- November 6, 2018
- 10 replies
- 81 views
It may be a question with a best answer, an implemented idea, or just a post needing no comment.
If you have a follow-up or related question, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.
10 replies
- Celebrity
- 3668 replies
- November 9, 2018
Is this happening with every image?
- Safer
- 383 replies
- November 14, 2018
Hi @fdharris13:
Are you able to share your FME workspace and PDF with us?
- Author
- 4 replies
- November 14, 2018
To some degree is happens to every file. The attached example is one of the better outputs.
- Safer
- 105 replies
- November 17, 2018
Hi @fdharris13
I tried your file, and got a pretty good result - I believe, I get 100%
Splitting into words does not work properly in some cases, but rather than that, it's all there.
I found two problems -
1) tesseract 3.05 didn't return any results for hOCR option (maybe the syntax of the command line changed - I didn't check that). My output was made with 3.02
2) With the original size of the image, Tesseract loses some numbers - the first dozen, and then occasionally, it drops a number here and there (for example, it does not see #96).
I reduced the size of the image to 50% of the original and got the picture above. This sounds counter-intuitive, but it helped - why, it's probably a question to Tesseract developers, I can only guess.
I am attaching my workspace. Within the transformer, I made one more change - I checked the option 'Process Duplicate Suppliers' in FeatureMergers.
I hope this helps.
Dmitri
- Author
- 4 replies
- November 19, 2018
Thanks, Dmitri, that works. Tesseract 3.05 does have a issue with the hOCR option but there is another Tesseract question that has a work around. I used both of these answers in my work space with great
results.
- 15 replies
- September 10, 2019
Hi @fdharris13 and @dmitribagh,
I am currently trying to work with the TesseractCaller to get Text out of some tiff files, yet I have downloaded the workbench you posted here in the forum as well as the pdf to try and see what output I get but the TesseractCaller always ends up Rejecting Features and doesn't give anoutput. I recently installed the latest version of TesseractCaller version 5 and have upgraded to FME2019. Any help or guidance will work.
Thank you,
Gerardo Rodriguez
- Safer
- 105 replies
- September 10, 2019
Hi @fdharris13 and @dmitribagh,
I am currently trying to work with the TesseractCaller to get Text out of some tiff files, yet I have downloaded the workbench you posted here in the forum as well as the pdf to try and see what output I get but the TesseractCaller always ends up Rejecting Features and doesn't give anoutput. I recently installed the latest version of TesseractCaller version 5 and have upgraded to FME2019. Any help or guidance will work.
Thank you,
Gerardo Rodriguez
Hi @gerardor,
I just tried the workspace with the pdf attached to this thread, and everything works correctly. My guess is that Tesseract 5 has some new command line syntax, which leads to rejecting features - the same thing happen with migration from Tesseract 3 to 4. I didn't try v. 5 yet, but this is where I would begin investigation. What error or message do you see in your log file?
Dmitri
- 15 replies
- September 10, 2019
Hi @gerardor,
I just tried the workspace with the pdf attached to this thread, and everything works correctly. My guess is that Tesseract 5 has some new command line syntax, which leads to rejecting features - the same thing happen with migration from Tesseract 3 to 4. I didn't try v. 5 yet, but this is where I would begin investigation. What error or message do you see in your log file?
Dmitri
Hi @dmitribagh,
so im not very familiar with TesseractCaller, but I did see that if you "edit" a tab opens up next to the Start and Main tabs on the top left of the canvas window. I ran the same workspace you just ran and mine rejects the features in the AttributeFileReader see image below. I also attach the log file.
- Safer
- 105 replies
- September 10, 2019
Hi @dmitribagh,
so im not very familiar with TesseractCaller, but I did see that if you "edit" a tab opens up next to the Start and Main tabs on the top left of the canvas window. I ran the same workspace you just ran and mine rejects the features in the AttributeFileReader see image below. I also attach the log file.
Hi @gerardor,
I see the following line in your log file:
TesseractCaller_2_SystemCaller: Failed to Execute `""C:\Users\gerardo.rodriguez\AppData\Local\Tesseract-OCR\tesseract.exe"
So basically, Tesseract never starts, and this is why nothing happens afterwards.
Can you run it from a command line?
Dmitri
- 15 replies
- September 12, 2019
Hi @gerardor,
I see the following line in your log file:
TesseractCaller_2_SystemCaller: Failed to Execute `""C:\Users\gerardo.rodriguez\AppData\Local\Tesseract-OCR\tesseract.exe"
So basically, Tesseract never starts, and this is why nothing happens afterwards.
Can you run it from a command line?
Dmitri
Hi @dmitribagh,
Unfortunately, the log file I attached was from a different workspace where I was trying to get the tesseract to work. Nevertheless, the log file attached now which is really from the FME Workbench @fdharris13 attached in this forum still has the same line you saw in the other log file. I know you mentioned that Tesseract never starts, but according to the log file the Tesseract is doing something but suddenly gives up and the message you have above appears. If you see the PNG file I attached above will show you where the Tesseract fails.
Unfortunately, I cannot run it in command line due to city policies(Adminstrative password to use command prompt). Any other suggestions or test I can do to see what is causing the Tesseract to execute?
I am trying to extract text from a table in a PDF image using the TesseractCaller. But only part of the table values are being returned. The text for the document is the same font and size and the contrast is the same for the entire document. Does anyone have any hints or tricks to get tesseract to return the complete table?
Is this happening with every image?
Hi @fdharris13:
Are you able to share your FME workspace and PDF with us?
To some degree is happens to every file. The attached example is one of the better outputs.
Marley Park Phase 4&5 Parcels 1A and 1B Address Map.pdf
Hi @fdharris13
I tried your file, and got a pretty good result - I believe, I get 100%
Splitting into words does not work properly in some cases, but rather than that, it's all there.
I found two problems -
1) tesseract 3.05 didn't return any results for hOCR option (maybe the syntax of the command line changed - I didn't check that). My output was made with 3.02
2) With the original size of the image, Tesseract loses some numbers - the first dozen, and then occasionally, it drops a number here and there (for example, it does not see #96).
I reduced the size of the image to 50% of the original and got the picture above. This sounds counter-intuitive, but it helped - why, it's probably a question to Tesseract developers, I can only guess.
I am attaching my workspace. Within the transformer, I made one more change - I checked the option 'Process Duplicate Suppliers' in FeatureMergers.
I hope this helps.
Dmitri
Thanks, Dmitri, that works. Tesseract 3.05 does have a issue with the hOCR option but there is another Tesseract question that has a work around. I used both of these answers in my work space with great
results.
Hi @fdharris13 and @dmitribagh,
I am currently trying to work with the TesseractCaller to get Text out of some tiff files, yet I have downloaded the workbench you posted here in the forum as well as the pdf to try and see what output I get but the TesseractCaller always ends up Rejecting Features and doesn't give anoutput. I recently installed the latest version of TesseractCaller version 5 and have upgraded to FME2019. Any help or guidance will work.
Thank you,
Gerardo Rodriguez
Hi @fdharris13 and @dmitribagh,
I am currently trying to work with the TesseractCaller to get Text out of some tiff files, yet I have downloaded the workbench you posted here in the forum as well as the pdf to try and see what output I get but the TesseractCaller always ends up Rejecting Features and doesn't give anoutput. I recently installed the latest version of TesseractCaller version 5 and have upgraded to FME2019. Any help or guidance will work.
Thank you,
Gerardo Rodriguez
Hi @gerardor,
I just tried the workspace with the pdf attached to this thread, and everything works correctly. My guess is that Tesseract 5 has some new command line syntax, which leads to rejecting features - the same thing happen with migration from Tesseract 3 to 4. I didn't try v. 5 yet, but this is where I would begin investigation. What error or message do you see in your log file?
Dmitri
Hi @gerardor,
I just tried the workspace with the pdf attached to this thread, and everything works correctly. My guess is that Tesseract 5 has some new command line syntax, which leads to rejecting features - the same thing happen with migration from Tesseract 3 to 4. I didn't try v. 5 yet, but this is where I would begin investigation. What error or message do you see in your log file?
Dmitri
Hi @dmitribagh,
so im not very familiar with TesseractCaller, but I did see that if you "edit" a tab opens up next to the Start and Main tabs on the top left of the canvas window. I ran the same workspace you just ran and mine rejects the features in the AttributeFileReader see image below. I also attach the log file.
Hi @dmitribagh,
so im not very familiar with TesseractCaller, but I did see that if you "edit" a tab opens up next to the Start and Main tabs on the top left of the canvas window. I ran the same workspace you just ran and mine rejects the features in the AttributeFileReader see image below. I also attach the log file.
Hi @gerardor,
I see the following line in your log file:
TesseractCaller_2_SystemCaller: Failed to Execute `""C:\Users\gerardo.rodriguez\AppData\Local\Tesseract-OCR\tesseract.exe"
So basically, Tesseract never starts, and this is why nothing happens afterwards.
Can you run it from a command line?
Dmitri
Hi @gerardor,
I see the following line in your log file:
TesseractCaller_2_SystemCaller: Failed to Execute `""C:\Users\gerardo.rodriguez\AppData\Local\Tesseract-OCR\tesseract.exe"
So basically, Tesseract never starts, and this is why nothing happens afterwards.
Can you run it from a command line?
Dmitri
Hi @dmitribagh,
Unfortunately, the log file I attached was from a different workspace where I was trying to get the tesseract to work. Nevertheless, the log file attached now which is really from the FME Workbench @fdharris13 attached in this forum still has the same line you saw in the other log file. I know you mentioned that Tesseract never starts, but according to the log file the Tesseract is doing something but suddenly gives up and the message you have above appears. If you see the PNG file I attached above will show you where the Tesseract fails.
Unfortunately, I cannot run it in command line due to city policies(Adminstrative password to use command prompt). Any other suggestions or test I can do to see what is causing the Tesseract to execute?
Related Topics
Question of the Week: The TempPathnameCreator and how it helps us read JPEG exif tagsicon
GeneralFME Weekly Quiz Results: Brian Pont (March 2020-2)icon
GeneralHTTPCaller and NextCloud configurationicon
GeneralQuestion of the Week: Choice with Alias Parameters Run Programmaticallyicon
GeneralLog file not writtenicon
Transformers
Helpful Members This Week
- takashi
25 votes
- nielsgerrits
13 votes
- hkingsbury
11 votes
- jdh
8 votes
- ebygomm
7 votes
- redgeographics
7 votes
- virtualcitymatt
7 votes
- panda
6 votes
- desiree_at_safe
5 votes
- david_r
5 votes
Recently Solved Questions
OneDrive single tenant web connection is approved but will not authenticate
1 ReplyGeocoder with ArcGIS Online
2 RepliesAvoiding overlaps and gaps when buffering line segments to polygons
2 RepliesGIS FC (FGDB) to Excel Spreadsheet
1 Reply'Feature Types to Read' not listing all selected input files
2 Replies
Community Stats
- 31,840
- Posts
- 120,937
- Replies
- 39,474
- Members
Latest FME
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.
Scanning file for viruses.
Sorry, we're still checking this file's contents to make sure it's safe to download. Please try again in a few minutes.
OKThis file cannot be downloaded
Sorry, our virus scanner detected that this file isn't safe to download.
OKCookie policy
We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.
Cookie settings
We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.