Skip to main content

Hi,

 

I have a table containing Id's and url's referring to online pdf's. For each record in the table I want to search a specific string in the associated pdf. If the string has a match in the pdf, I want to retrieve the Id from the record containing the url to the pdf. I tried using feature reader en pdf reader but didn't get far. So any help would be welcome.

Idris

 

Hi @idrispeiren,

Please could you provide a some sample data to help us understand exactly what you are trying to achieve?


Hi Holly,

Here's a screenshot of the table:

 

For each record in the table I want to search a specified string (eg. bedrijvigheid) in the attribute "Stedenbouwkundige voorschriften" (sorry for the ducth terms). If the string has a match in the pdf, I want to keep that record. A sample file is included as attachment.Link_pdf.gdb.zip


For some reason it won't work if I plug in the url in a FeatureReader, but if I use a HTTPCaller to save a local copy of the PDF and then open that using the FeatureReader it does work.

pdf_searching.fmw

Note that I strongly recommend a Decelerator. You will be hitting the webserver that hosts the PDF once per feature, so that's over 1700 times for this dataset. If you do that at FME's regular speed it might overload it or be seen as a DDOS attack (I've once done that).

You're also very much dependent on how the PDF is structured. The first one that I've used as a sample appears to be a fairly good one, but there's no guarantee they'll all be like that. If it's a scanned form you're out of luck.

A very important parameter is in the FeatureReader, make sure that in the PDF parameters there you set the Spatial Text one to "Feature Per Block". That way it tries to make one text object per line.


For some reason it won't work if I plug in the url in a FeatureReader, but if I use a HTTPCaller to save a local copy of the PDF and then open that using the FeatureReader it does work.

pdf_searching.fmw

Note that I strongly recommend a Decelerator. You will be hitting the webserver that hosts the PDF once per feature, so that's over 1700 times for this dataset. If you do that at FME's regular speed it might overload it or be seen as a DDOS attack (I've once done that).

You're also very much dependent on how the PDF is structured. The first one that I've used as a sample appears to be a fairly good one, but there's no guarantee they'll all be like that. If it's a scanned form you're out of luck.

A very important parameter is in the FeatureReader, make sure that in the PDF parameters there you set the Spatial Text one to "Feature Per Block". That way it tries to make one text object per line.

Thanks for the sample, that's what I wanted to achieve! One more question: how can I keep the initial attributes for the matched records (seems to be in the initiator port)?


Thanks for the sample, that's what I wanted to achieve! One more question: how can I keep the initial attributes for the matched records (seems to be in the initiator port)?

Check the accumulation mode of the FeatureReader, if you set it to "Merge initiator and result" it should do the trick.


Check the accumulation mode of the FeatureReader, if you set it to "Merge initiator and result" it should do the trick.

Indeed! Thanks again, you've been a great help!


Hi, I'm deploying the model for all of the data and I'm encountering another problem. The fme model is stopping for some reason, although "Ingore Failed Readers"is set to "Yes".

Any suggestions here?

Idris

DSI_terreinen_in_planning_v6_stringsearch_categoriebedrijvigheid.fmw

 


Hi, I'm deploying the model for all of the data and I'm encountering another problem. The fme model is stopping for some reason, although "Ingore Failed Readers"is set to "Yes".

Any suggestions here?

Idris

DSI_terreinen_in_planning_v6_stringsearch_categoriebedrijvigheid.fmw

 

Translation fails around feature 487.000 from the feature reader


Hi, I'm deploying the model for all of the data and I'm encountering another problem. The fme model is stopping for some reason, although "Ingore Failed Readers"is set to "Yes".

Any suggestions here?

Idris

DSI_terreinen_in_planning_v6_stringsearch_categoriebedrijvigheid.fmw

 

Try setting the "Read Images" parameter on the FeatureReader to No


Try setting the "Read Images" parameter on the FeatureReader to No

Ok, that did the trick!


Ok, that did the trick!

Something different now: "PDF Reader: Failed to open document xxx.pdf' because the file is not in PDF format, or because it is corrupted."

Ignore failed readers is set to yes, but the translation is aborted.


Something different now: "PDF Reader: Failed to open document xxx.pdf' because the file is not in PDF format, or because it is corrupted."

Ignore failed readers is set to yes, but the translation is aborted.

Hi @idrispeiren1, Thank you for pointing this out!

 

It looks like that the "Ignore Failed Readers" setting may be only taking stand alone readers into consideration, and not have controle over FeatureReader transformer (or other transformers that reads data). I found this idea with suggestion to re-design the behavior of this parameter: https://knowledge.safe.com/content/idea/41137/revisit-ignore-failed-readers-setting.html

 

It would be appreciated if you could add your use case and desired behavior to that idea.

 

It doesn't appear this parameter is very widely used, so I don't know when we will have the resource to work on this improvement. But your feedback on it will help us understand how people want to use it. Thank you!

Hi @idrispeiren1, Thank you for pointing this out!

 

It looks like that the "Ignore Failed Readers" setting may be only taking stand alone readers into consideration, and not have controle over FeatureReader transformer (or other transformers that reads data). I found this idea with suggestion to re-design the behavior of this parameter: https://knowledge.safe.com/content/idea/41137/revisit-ignore-failed-readers-setting.html

 

It would be appreciated if you could add your use case and desired behavior to that idea.

 

It doesn't appear this parameter is very widely used, so I don't know when we will have the resource to work on this improvement. But your feedback on it will help us understand how people want to use it. Thank you!

Hi Xiaomeng, hopefully your suggestion is picked up in the near future. You have my vote!


Reply