Solved

How to extract URLs from HTML and create features?

6 years ago
March 12, 2019
10 replies
321 views

edhere
60 replies

Hi FME ninjas,

I'm using the HTTPCaller to call a website containing multiple URLs in its HTML.

I'm trying to extract all the URLs that have below format, based on these I'd like to create and attribute (URL) that shows all unique urls as features.

What I have:

What I want:

Does this make any sense ;-) ?

Thanks,

Eduard

Best answer by nielsgerrits

**Update**

With the addition of @takashi, something like this might be what you need.

htmlextractor2018.fmwt

**Original**

Looks like this question?

HtmlExtractor, a[href] as CSS Selector. Does this work in your situation? Did not try yet...

View original

Did this help you find an answer to your question?

+54

nielsgerrits
2847 replies
Best Answer
6 years ago
March 12, 2019

**Update**

With the addition of @takashi, something like this might be what you need.

htmlextractor2018.fmwt

**Original**

Looks like this question?

HtmlExtractor, a[href] as CSS Selector. Does this work in your situation? Did not try yet...

takashi
7704 replies
6 years ago
March 12, 2019

nielsgerrits wrote:

**Update**

With the addition of @takashi, something like this might be what you need.

htmlextractor2018.fmwt

**Original**

Looks like this question?

HtmlExtractor, a[href] as CSS Selector. Does this work in your situation? Did not try yet...

Yes, the HTMLExtractor does the trick. You can extract the URLs as a list attribute directly with this setting.

edhere
Author
60 replies
6 years ago
March 12, 2019

Mmm... I can't seem to get any data using above HTMLExtractors.

Can you perhaps have a look?

This is the page I'm trying to read: https://www.primera.nl/winkels/

My workbench so far:

Cheers,

takashi
7704 replies
6 years ago
March 12, 2019

edhere wrote:

Mmm... I can't seem to get any data using above HTMLExtractors.

Can you perhaps have a look?

This is the page I'm trying to read: https://www.primera.nl/winkels/

My workbench so far:

Cheers,

Yes I can.

edhere
Author
60 replies
6 years ago
March 12, 2019

takashi wrote:

Yes I can.

Oh oh... I forgot to explode the list huh?

+54

nielsgerrits
2847 replies
6 years ago
March 12, 2019

takashi wrote:

Yes, the HTMLExtractor does the trick. You can extract the URLs as a list attribute directly with this setting.

Learned something new today, thanks.

edhere
Author
60 replies
6 years ago
March 12, 2019

Thank you @nielsgerrits / @takashi for your quick responses.

I have exploded the URL list and I now have the data I was looking for.

Best,

+29

jkr_wrk
381 replies
6 years ago
March 12, 2019

edhere wrote:

Oh oh... I forgot to explode the list huh?

With the following settings I get a list of 511 shops.

Indeed with a ListExploder. I always forget it is called that.

💡 Did you know... The FeatureWriter is more intuitive. 😏

edhere
Author
60 replies
6 years ago
March 12, 2019

jkr_da wrote:

With the following settings I get a list of 511 shops.

Indeed with a ListExploder. I always forget it is called that.

I currently have 493.

I will try your approach, thanks for sharing.

takashi
7704 replies
6 years ago
March 12, 2019

edhere wrote:

Mmm... I can't seem to get any data using above HTMLExtractors.

Can you perhaps have a look?

This is the page I'm trying to read: https://www.primera.nl/winkels/

My workbench so far:

Cheers,

Some pattern matching are available in CSS selector expression. For example, this setting extracts only URLs which begin with "https". See here to learn more: CSS Selector Reference

a[href^="https"]

In addition, you can also download the source HTML directly with the HTMLExtractor.

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

How to extract URLs from HTML and create features?