Solved

How would I read a table from multiple pdf if they are different and they are not tagged?

Forum|Forum|4 years ago
September 2, 2021
7 replies
155 views

juandiegoboh

Hello, I am trying to create a workspace for doing a batch identification of information contained in tables within some pdf files, the thing is that the pdf also contains some other information as text, so I am a little bit confused on how to achieve that, and also if it will work with different tables, because the dataset I have was made by multiple authors so they are different tables on each pdf. I am wondering if it would be better to use a external application or a PythonCaller to achieve that, but before trying that out I would like to confirm first. I am attaching a sample pdf of my dataset so you can see the table that is in it. Thank you!

Best answer by geomancer

Thanks for the sample @hkingsbury I can follow the process, but in terms of looking to set up an 'attribute' table based on an incoming table, what might be the things to keep in mind? Considering that I have filtered the text that is supposed to be the content of the table as it shows me.

Hi @juandiegoboh , as I had a slow day at the office today, I decided to take on this challenge. And with some success!

pdf2d2none I couldn't have done this without the basics provided by @hkingsbury .

There were several hurdles to be taken along the way:

Recreate the cells of the table;
Create a single text from the multiple texts in one cell;
Think of a way to handle the merged cells;
Think of a way to create the attributes to write the texts to.

These challenges lead to many wrong turns and dead ends along the way.

For the merged cells from the table the text is written into the attribute representing the topmost and leftmost cell only; the other attributes are left empty.

Please dive into the attached workspace to find out exactly how it all works. I am not sure whether this solution is optimal, or whether it is usable for other PDF files, but I had a lot of fun today.

Now it's high time I get on with my regular work!

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

+73

hkingsbury
Celebrity
Forum|Forum|4 years ago
September 2, 2021

I've attached a sample workbench that will identify the table and the text in the table. Turning the 'visual pdf table' into a 'attribute' table will be a bit harder

Upvote

juandiegoboh
Author
Forum|Forum|4 years ago
September 3, 2021

I've attached a sample workbench that will identify the table and the text in the table. Turning the 'visual pdf table' into a 'attribute' table will be a bit harder

Upvote

+73

hkingsbury
Celebrity
Forum|Forum|4 years ago
September 5, 2021

Thats going to be quite a bit harder. You're going to have to identify each individual cell, work out what column/row they belong to. I also note in your example you have some merge cells, so thats going to add even more complexity

Upvote

juandiegoboh
Author
Forum|Forum|4 years ago
September 7, 2021

Thank you @hkingsbury I'll keep working to see if I can do it.

Upvote

+64

geomancer
Evangelist
Best Answer
Forum|Forum|4 years ago
September 8, 2021

Hi @juandiegoboh , as I had a slow day at the office today, I decided to take on this challenge. And with some success!

pdf2d2none I couldn't have done this without the basics provided by @hkingsbury .

There were several hurdles to be taken along the way:

Recreate the cells of the table;
Create a single text from the multiple texts in one cell;
Think of a way to handle the merged cells;
Think of a way to create the attributes to write the texts to.

These challenges lead to many wrong turns and dead ends along the way.

For the merged cells from the table the text is written into the attribute representing the topmost and leftmost cell only; the other attributes are left empty.

Now it's high time I get on with my regular work!

Upvote

juandiegoboh
Author
Forum|Forum|4 years ago
September 8, 2021

Thanks for the ideas you shared @geomancer they have been very helpful, it would have been much more difficult to understand on my own. Now I am trying to make the workflow a bit more dynamic because I expect to process multiple tables containing pdf in this workspace and they will surely not always have 7 columns like in this case.I did some tweaking at the beginning to filter the page that contains the table (in this case, I only share the page with the table but the pdfs have more than 1 page). I will keep working until I get the desired result.

Upvote

+64

geomancer
Evangelist
Forum|Forum|4 years ago
September 9, 2021

Hi @juandiegoboh , glad to read you can use some of my ideas.

When your table has more than 7 columns, you can expose more attributes in AttributeExposer_H, and of course you have to add more ListSearcher_N, ListElementExtractor_N combinations.

I expect the rest of the workspace would continue to work, regardless of the number of columns.

There is one caveat I can think of: the workspace assumes the first row of the table (header row) contains no merged cells.

I'm not sure how this could be made more dynamic, but I'm very interested in your improvements. Maybe throw in some Python scripting?

Good luck!

Upvote

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute