Solved

How to extract URLs from HTML and create features?

Forum|Forum|6 years ago
March 12, 2019
10 replies
402 views

edhere

Hi FME ninjas,

I'm using the HTTPCaller to call a website containing multiple URLs in its HTML.

I'm trying to extract all the URLs that have below format, based on these I'd like to create and attribute (URL) that shows all unique urls as features.

What I have:

What I want:

Does this make any sense ;-) ?

Thanks,

Eduard

Best answer by nielsgerrits

**Update**

With the addition of @takashi, something like this might be what you need.

htmlextractor2018.fmwt

**Original**

Looks like this question?

HtmlExtractor, a[href] as CSS Selector. Does this work in your situation? Did not try yet...

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

+64

nielsgerrits
Best Answer
Forum|Forum|6 years ago
March 12, 2019

**Update**

With the addition of @takashi, something like this might be what you need.

htmlextractor2018.fmwt

**Original**

Looks like this question?

HtmlExtractor, a[href] as CSS Selector. Does this work in your situation? Did not try yet...

Buckle your seatbelt Dorothy, cause Kansas is going bye-bye...

Upvote

takashi
Forum|Forum|6 years ago
March 12, 2019

**Update**

With the addition of @takashi, something like this might be what you need.

htmlextractor2018.fmwt

**Original**

Looks like this question?

HtmlExtractor, a[href] as CSS Selector. Does this work in your situation? Did not try yet...

Yes, the HTMLExtractor does the trick. You can extract the URLs as a list attribute directly with this setting.

Why not inspect features with Visual/Data Preview and Feature/Record Information before writing them into a destination dataset?

Upvote

edhere
Author
Forum|Forum|6 years ago
March 12, 2019

Mmm... I can't seem to get any data using above HTMLExtractors.

Can you perhaps have a look?

This is the page I'm trying to read: https://www.primera.nl/winkels/

My workbench so far:

Cheers,

Upvote

takashi
Forum|Forum|6 years ago
March 12, 2019

Mmm... I can't seem to get any data using above HTMLExtractors.

Can you perhaps have a look?

This is the page I'm trying to read: https://www.primera.nl/winkels/

My workbench so far:

Cheers,

Yes I can.

Why not inspect features with Visual/Data Preview and Feature/Record Information before writing them into a destination dataset?

Upvote

edhere
Author
Forum|Forum|6 years ago
March 12, 2019

Yes I can.

Oh oh... I forgot to explode the list huh?

Upvote

+64

nielsgerrits
Forum|Forum|6 years ago
March 12, 2019

Yes, the HTMLExtractor does the trick. You can extract the URLs as a list attribute directly with this setting.

Learned something new today, thanks.

Buckle your seatbelt Dorothy, cause Kansas is going bye-bye...

Upvote

edhere
Author
Forum|Forum|6 years ago
March 12, 2019

Thank you @nielsgerrits / @takashi for your quick responses.

I have exploded the URL list and I now have the data I was looking for.

Best,

Upvote

+36

jkr_wrk
Forum|Forum|6 years ago
March 12, 2019

Oh oh... I forgot to explode the list huh?

With the following settings I get a list of 511 shops.

Indeed with a ListExploder. I always forget it is called that.

💡 Did you know... The FeatureWriter is more intuitive. 😏

Upvote

edhere
Author
Forum|Forum|6 years ago
March 12, 2019

With the following settings I get a list of 511 shops.

Indeed with a ListExploder. I always forget it is called that.

I currently have 493.

I will try your approach, thanks for sharing.

Upvote

takashi
Forum|Forum|6 years ago
March 12, 2019

Mmm... I can't seem to get any data using above HTMLExtractors.

Can you perhaps have a look?

This is the page I'm trying to read: https://www.primera.nl/winkels/

My workbench so far:

Cheers,

Some pattern matching are available in CSS selector expression. For example, this setting extracts only URLs which begin with "https". See here to learn more: CSS Selector Reference

a[href^="https"]

In addition, you can also download the source HTML directly with the HTMLExtractor.

Why not inspect features with Visual/Data Preview and Feature/Record Information before writing them into a destination dataset?

Upvote

How to extract URLs from HTML and create features?

10 replies

Community Stats

Latest FME

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded