Solved

How would I read a table from multiple pdf if they are different and they are not tagged?

  • 2 September 2021
  • 7 replies
  • 11 views

Badge +1

Hello, I am trying to create a workspace for doing a batch identification of information contained in tables within some pdf files, the thing is that the pdf also contains some other information as text, so I am a little bit confused on how to achieve that, and also if it will work with different tables, because the dataset I have was made by multiple authors so they are different tables on each pdf. I am wondering if it would be better to use a external application or a PythonCaller to achieve that, but before trying that out I would like to confirm first. I am attaching a sample pdf of my dataset so you can see the table that is in it. Thank you!

icon

Best answer by geomancer 8 September 2021, 16:29

View original

7 replies

Userlevel 5
Badge +29

I've attached a sample workbench that will identify the table and the text in the table. Turning the 'visual pdf table' into a 'attribute' table will be a bit harder

 

 

Badge +1

I've attached a sample workbench that will identify the table and the text in the table. Turning the 'visual pdf table' into a 'attribute' table will be a bit harder

 

 

Thanks for the sample @hkingsbury​  I can follow the process, but in terms of looking to set up an 'attribute' table based on an incoming table, what might be the things to keep in mind? Considering that I have filtered the text that is supposed to be the content of the table as it shows me.

Userlevel 5
Badge +29

Thanks for the sample @hkingsbury​  I can follow the process, but in terms of looking to set up an 'attribute' table based on an incoming table, what might be the things to keep in mind? Considering that I have filtered the text that is supposed to be the content of the table as it shows me.

Thats going to be quite a bit harder. You're going to have to identify each individual cell, work out what column/row they belong to. I also note in your example you have some merge cells, so thats going to add even more complexity

Badge +1

Thanks for the sample @hkingsbury​  I can follow the process, but in terms of looking to set up an 'attribute' table based on an incoming table, what might be the things to keep in mind? Considering that I have filtered the text that is supposed to be the content of the table as it shows me.

Thank you @hkingsbury​ I'll keep working to see if I can do it. 

Userlevel 3
Badge +33

Thanks for the sample @hkingsbury​  I can follow the process, but in terms of looking to set up an 'attribute' table based on an incoming table, what might be the things to keep in mind? Considering that I have filtered the text that is supposed to be the content of the table as it shows me.

Hi @juandiegoboh​ , as I had a slow day at the office today, I decided to take on this challenge. And with some success!

pdf2d2noneI couldn't have done this without the basics provided by @hkingsbury​ .

There were several hurdles to be taken along the way:

  • Recreate the cells of the table;
  • Create a single text from the multiple texts in one cell;
  • Think of a way to handle the merged cells;
  • Think of a way to create the attributes to write the texts to.

These challenges lead to many wrong turns and dead ends along the way.

 

For the merged cells from the table the text is written into the attribute representing the topmost and leftmost cell only; the other attributes are left empty.

 

Please dive into the attached workspace to find out exactly how it all works. I am not sure whether this solution is optimal, or whether it is usable for other PDF files, but I had a lot of fun today.

 

Now it's high time I get on with my regular work!

Badge +1

Thanks for the sample @hkingsbury​  I can follow the process, but in terms of looking to set up an 'attribute' table based on an incoming table, what might be the things to keep in mind? Considering that I have filtered the text that is supposed to be the content of the table as it shows me.

Thanks for the ideas you shared @geomancer​  they have been very helpful, it would have been much more difficult to understand on my own. Now I am trying to make the workflow a bit more dynamic because I expect to process multiple tables containing pdf in this workspace and they will surely not always have 7 columns like in this case.I did some tweaking at the beginning to filter the page that contains the table (in this case, I only share the page with the table but the pdfs have more than 1 page). I will keep working until I get the desired result.

Userlevel 3
Badge +33

Thanks for the sample @hkingsbury​  I can follow the process, but in terms of looking to set up an 'attribute' table based on an incoming table, what might be the things to keep in mind? Considering that I have filtered the text that is supposed to be the content of the table as it shows me.

Hi @juandiegoboh​ , glad to read you can use some of my ideas.

When your table has more than 7 columns, you can expose more attributes in AttributeExposer_H, and of course you have to add more ListSearcher_N, ListElementExtractor_N combinations.

I expect the rest of the workspace would continue to work, regardless of the number of columns.

 

There is one caveat I can think of: the workspace assumes the first row of the table (header row) contains no merged cells.

 

I'm not sure how this could be made more dynamic, but I'm very interested in your improvements. Maybe throw in some Python scripting?

Good luck!

Reply