Skip to main content

I'm struggling to transform some pdfs. I am using the PDFtoTEXT transformer as well as a textwriter, stringsearcher, maybe calling to python, and then writing it to a database. I'm flummoxed at the first step.

Sometimes it'll write a pdf to a textfile. But often the pdf2text program box pops up. That's useless for automation. What I want is for it to read a whole folder of texts - found the place to specify that - and process each of them into a complex row in my database with many different values. The pdf contains a couple values in the headers and many more in a table-like layout. I will need a couple string parsing processes to get through all that. No idea how to get it to read the pdf name and put that in one cell in my db.

But the data inspectors are useless. The interface says the transformation was successful - but nothing was written to the output.txt.

So - issues:

- how to process multiple texts

- how to see what's going on so I know it's working properly

- how to write to my database.

I understand textfiles and regex. That's not the problem. FME is driving me nuts. I am trying to learn FME.

Thanks,

Marsh

I haven't worked with the PDF-reader yet, but getting the PDF (file)name is simple:

You can turn on the attribute 'fme_basename' in the Format Attributes of your reader or use a AttributeExposer to do so. It should show you the filename of the PDF.


One good test for whether PDFs are easily readable is to try to copy-paste the text in a PDF viewer (such as Chrome, Firefox, or Adobe Acrobat).

If the pasted text is garbled (or if you can't select the text at all), then you're probably going to have to use OCR like Tesseract (there's a transformer for that). Otherwise if the pasted text looks good, then tools such as pdf2text should probably work.

By the way, if you post your workspace and data here (or a sample workspace with a similar idea to what you're trying to do), we can probably help you a lot more.

A note: it seems like you aren't using FME 2018.0, right? There's a new PDFReader in 2018.0 (trial here) that should hopefully give you more feedback than an external tool like pdf2text; I'd encourage you to give it a try!


One good test for whether PDFs are easily readable is to try to copy-paste the text in a PDF viewer (such as Chrome, Firefox, or Adobe Acrobat).

If the pasted text is garbled (or if you can't select the text at all), then you're probably going to have to use OCR like Tesseract (there's a transformer for that). Otherwise if the pasted text looks good, then tools such as pdf2text should probably work.

By the way, if you post your workspace and data here (or a sample workspace with a similar idea to what you're trying to do), we can probably help you a lot more.

A note: it seems like you aren't using FME 2018.0, right? There's a new PDFReader in 2018.0 (trial here) that should hopefully give you more feedback than an external tool like pdf2text; I'd encourage you to give it a try!

The pdf is very readable. I used the physical layout setting. Multiple times I have gotten it to create a pdf. The new pdfreader is just the old pdfreader in a slightly new wrapper. It is still based on the xpdf tool pdfTOtext. I got 2018 because I was told the pdf tool was different. It really isn't. And the xpdf interface pops up and gets in the way, quite often. And it doesn't even have the same parameters I set in the FME wrapper. (see image) Maybe it's not even doing anything. No idea why it pops up. Seems I have to dismiss it for the process to complete.

 

 

I've added a featurewriter, hoping to see the output of this process,- but it just overwrites the specified output file to a blank file. I thought it was doing nothing at all, but after putting some test text in there, I see it overwrote it.

 

 

I'm trying to develop the process with a single pdf, but ultimately I have hundreds or thousands of them. I see the droptown tab in the pdf2text reader, where I can specify a folder rather than an individual file. Although I now know what the file name attribute is, I can't see how to write it somewhere that will ultimately become the the value in the first field of my Access database table. I want to parse the text, build that list of values, and lay them down as a new record, ultimately.

 

 

 

 


The pdf is very readable. I used the physical layout setting. Multiple times I have gotten it to create a pdf. The new pdfreader is just the old pdfreader in a slightly new wrapper. It is still based on the xpdf tool pdfTOtext. I got 2018 because I was told the pdf tool was different. It really isn't. And the xpdf interface pops up and gets in the way, quite often. And it doesn't even have the same parameters I set in the FME wrapper. (see image) Maybe it's not even doing anything. No idea why it pops up. Seems I have to dismiss it for the process to complete.

 

 

I've added a featurewriter, hoping to see the output of this process,- but it just overwrites the specified output file to a blank file. I thought it was doing nothing at all, but after putting some test text in there, I see it overwrote it.

 

 

I'm trying to develop the process with a single pdf, but ultimately I have hundreds or thousands of them. I see the droptown tab in the pdf2text reader, where I can specify a folder rather than an individual file. Although I now know what the file name attribute is, I can't see how to write it somewhere that will ultimately become the the value in the first field of my Access database table. I want to parse the text, build that list of values, and lay them down as a new record, ultimately.

 

 

 

 

@marshwetland I promise the new PDFReader is not a wrapper around pdf2text. I should know: I wrote the PDFReader :)

 

 

Maybe you're still using the PDF2TextReader custom transformer from FMEHub?

 

 

If you still want to use that transformer, and your problem is not being able to specify a directory, you might want to give the "Directory and File Pathnames" reader a try. Then you can invoke a child workspace where you publish a parameter named something like "PDF_FILENAME" which you can use in the PDF2TextReader's "PDF File(s)" parameter.

 

 

To do this with the new PDFReader is probably easier: you can just send the pathname features to a FeatureReader that calls the PDFReader (where the PDFReader dataset is the value of the path_unix attribute).

 

 

edit: Just for clarity, the full name of the new reader is the "Adobe Geospatial PDF Reader". It reads many kinds of PDFs though, not solely geospatial ones

 


Reply