Solved

Search string in pdf

Forum|Forum|7 years ago
November 9, 2018
13 replies
124 views

idrispeiren

Hi,

I have a table containing Id's and url's referring to online pdf's. For each record in the table I want to search a specific string in the associated pdf. If the string has a match in the pdf, I want to retrieve the Id from the record containing the url to the pdf. I tried using feature reader en pdf reader but didn't get far. So any help would be welcome.

Idris

Best answer by redgeographics

For some reason it won't work if I plug in the url in a FeatureReader, but if I use a HTTPCaller to save a local copy of the PDF and then open that using the FeatureReader it does work.

pdf_searching.fmw

Note that I strongly recommend a Decelerator. You will be hitting the webserver that hosts the PDF once per feature, so that's over 1700 times for this dataset. If you do that at FME's regular speed it might overload it or be seen as a DDOS attack (I've once done that).

You're also very much dependent on how the PDF is structured. The first one that I've used as a sample appears to be a fairly good one, but there's no guarantee they'll all be like that. If it's a scanned form you're out of luck.

A very important parameter is in the FeatureReader, make sure that in the PDF parameters there you set the Spatial Text one to "Feature Per Block". That way it tries to make one text object per line.

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

hollyatsafe
Forum|Forum|7 years ago
November 9, 2018

Hi @idrispeiren,

Please could you provide a some sample data to help us understand exactly what you are trying to achieve?

Upvote

idrispeiren
Author
Forum|Forum|7 years ago
November 12, 2018

Hi Holly,

Here's a screenshot of the table:

For each record in the table I want to search a specified string (eg. bedrijvigheid) in the attribute "Stedenbouwkundige voorschriften" (sorry for the ducth terms). If the string has a match in the pdf, I want to keep that record. A sample file is included as attachment.Link_pdf.gdb.zip

Upvote

+62

redgeographics
Celebrity
Best Answer
Forum|Forum|7 years ago
November 12, 2018

For some reason it won't work if I plug in the url in a FeatureReader, but if I use a HTTPCaller to save a local copy of the PDF and then open that using the FeatureReader it does work.

pdf_searching.fmw

A very important parameter is in the FeatureReader, make sure that in the PDF parameters there you set the Spatial Text one to "Feature Per Block". That way it tries to make one text object per line.

FME rocks! \m/

Upvote

idrispeiren
Author
Forum|Forum|7 years ago
November 13, 2018

For some reason it won't work if I plug in the url in a FeatureReader, but if I use a HTTPCaller to save a local copy of the PDF and then open that using the FeatureReader it does work.

pdf_searching.fmw

A very important parameter is in the FeatureReader, make sure that in the PDF parameters there you set the Spatial Text one to "Feature Per Block". That way it tries to make one text object per line.

Thanks for the sample, that's what I wanted to achieve! One more question: how can I keep the initial attributes for the matched records (seems to be in the initiator port)?

Upvote

+62

redgeographics
Celebrity
Forum|Forum|7 years ago
November 13, 2018

Thanks for the sample, that's what I wanted to achieve! One more question: how can I keep the initial attributes for the matched records (seems to be in the initiator port)?

Check the accumulation mode of the FeatureReader, if you set it to "Merge initiator and result" it should do the trick.

FME rocks! \m/

Upvote

idrispeiren
Author
Forum|Forum|7 years ago
November 13, 2018

Check the accumulation mode of the FeatureReader, if you set it to "Merge initiator and result" it should do the trick.

Indeed! Thanks again, you've been a great help!

Upvote

idrispeiren1
Forum|Forum|7 years ago
December 20, 2018

Hi, I'm deploying the model for all of the data and I'm encountering another problem. The fme model is stopping for some reason, although "Ingore Failed Readers"is set to "Yes".

Any suggestions here?

Idris

DSI_terreinen_in_planning_v6_stringsearch_categoriebedrijvigheid.fmw

Upvote

idrispeiren1
Forum|Forum|7 years ago
December 20, 2018

Hi, I'm deploying the model for all of the data and I'm encountering another problem. The fme model is stopping for some reason, although "Ingore Failed Readers"is set to "Yes".

Any suggestions here?

Idris

DSI_terreinen_in_planning_v6_stringsearch_categoriebedrijvigheid.fmw

Translation fails around feature 487.000 from the feature reader

Upvote

+62

redgeographics
Celebrity
Forum|Forum|7 years ago
December 20, 2018

Hi, I'm deploying the model for all of the data and I'm encountering another problem. The fme model is stopping for some reason, although "Ingore Failed Readers"is set to "Yes".

Any suggestions here?

Idris

DSI_terreinen_in_planning_v6_stringsearch_categoriebedrijvigheid.fmw

Try setting the "Read Images" parameter on the FeatureReader to No

FME rocks! \m/

Upvote

idrispeiren1
Forum|Forum|7 years ago
December 21, 2018

Try setting the "Read Images" parameter on the FeatureReader to No

Ok, that did the trick!

Upvote

idrispeiren1
Forum|Forum|7 years ago
December 24, 2018

Ok, that did the trick!

Something different now: "PDF Reader: Failed to open document xxx.pdf' because the file is not in PDF format, or because it is corrupted."

Ignore failed readers is set to yes, but the translation is aborted.

Upvote

+11

xiaomengatsafe
Safer
Forum|Forum|7 years ago
January 8, 2019

Something different now: "PDF Reader: Failed to open document xxx.pdf' because the file is not in PDF format, or because it is corrupted."

Ignore failed readers is set to yes, but the translation is aborted.

Hi @idrispeiren1, Thank you for pointing this out!

It looks like that the "Ignore Failed Readers" setting may be only taking stand alone readers into consideration, and not have controle over FeatureReader transformer (or other transformers that reads data). I found this idea with suggestion to re-design the behavior of this parameter: https://knowledge.safe.com/content/idea/41137/revisit-ignore-failed-readers-setting.html

It would be appreciated if you could add your use case and desired behavior to that idea.

It doesn't appear this parameter is very widely used, so I don't know when we will have the resource to work on this improvement. But your feedback on it will help us understand how people want to use it. Thank you!

Upvote

idrispeiren
Author
Forum|Forum|7 years ago
January 9, 2019

Hi @idrispeiren1, Thank you for pointing this out!

It would be appreciated if you could add your use case and desired behavior to that idea.

Hi Xiaomeng, hopefully your suggestion is picked up in the near future. You have my vote!

Upvote

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded