Skip to main content

I have been using a workflow that was previously posted here to try and extract text, representing a table, from a PDF and write out a worksheet (for each table) to excel. I have been able to extract and isolate the desired text but I cannot figure out how to properly organize the records to write to the excel cells.

Hi @mcgregrr

Unfortunately, extracting information from PDF (and in particular tables in PDFs) can be quite difficult as information is often not grouped in a logical manner. A workflow that may work for one portion of your file may not work for another so manual editing is likely required.

For the tables information you are trying to extract, I would recommend organizing by each row and column of a table. For rows, you could group by the _y position using a transformer such as an Aggregator.

It will be slightly trickier to figure out columns as data within a column will not have the same _x position. You could use the positions of the header column to figure out a range of _x values that contains data for a column.

With this column and row information, it may be possible to try to 'organize' the data in a logical manner which will make it easier to write out. I have attached an example workspace which attempts to do this for one of the tables in your PDF. I hope this helps. pdftableextracttoexcel.fmw


Reply