Solved

How to extract URLs from HTML and create features?


Badge

Hi FME ninjas,

I'm using the HTTPCaller to call a website containing multiple URLs in its HTML.

 

I'm trying to extract all the URLs that have below format, based on these I'd like to create and attribute (URL) that shows all unique urls as features.

 

What I have:

 

What I want:

 

 

Does this make any sense ;-) ?

 

 

Thanks,

Eduard

icon

Best answer by nielsgerrits 12 March 2019, 09:30

View original

10 replies

Userlevel 6
Badge +31

**Update**

With the addition of @takashi, something like this might be what you need.

htmlextractor2018.fmwt

**Original**

Looks like this question?

HtmlExtractor, a[href] as CSS Selector. Does this work in your situation? Did not try yet...

Userlevel 2
Badge +17

**Update**

With the addition of @takashi, something like this might be what you need.

htmlextractor2018.fmwt

**Original**

Looks like this question?

HtmlExtractor, a[href] as CSS Selector. Does this work in your situation? Did not try yet...

Yes, the HTMLExtractor does the trick. You can extract the URLs as a list attribute directly with this setting.

Badge

Mmm... I can't seem to get any data using above HTMLExtractors.

 

Can you perhaps have a look?

This is the page I'm trying to read: https://www.primera.nl/winkels/

My workbench so far:

 

 

Cheers,

 

Ed

 

Userlevel 2
Badge +17

Mmm... I can't seem to get any data using above HTMLExtractors.

 

Can you perhaps have a look?

This is the page I'm trying to read: https://www.primera.nl/winkels/

My workbench so far:

 

 

Cheers,

 

Ed

 

Yes I can.

Badge

Yes I can.

Oh oh... I forgot to explode the list huh?

Userlevel 6
Badge +31

Yes, the HTMLExtractor does the trick. You can extract the URLs as a list attribute directly with this setting.

Learned something new today, thanks.

Badge

Thank you @nielsgerrits / @takashi for your quick responses.

 

I have exploded the URL list and I now have the data I was looking for.

 

Best,

 

Ed
Badge +15

Oh oh... I forgot to explode the list huh?

 

With the following settings I get a list of 511 shops.

Indeed with a ListExploder. I always forget it is called that.

 

Badge

 

With the following settings I get a list of 511 shops.

Indeed with a ListExploder. I always forget it is called that.

 

I currently have 493.

 

I will try your approach, thanks for sharing.
Userlevel 2
Badge +17

Mmm... I can't seem to get any data using above HTMLExtractors.

 

Can you perhaps have a look?

This is the page I'm trying to read: https://www.primera.nl/winkels/

My workbench so far:

0684Q00000ArL8HQAV.png

 

 

Cheers,

 

Ed

 

Some pattern matching are available in CSS selector expression. For example, this setting extracts only URLs which begin with "https". See here to learn more: CSS Selector Reference

a[href^="https"]

In addition, you can also download the source HTML directly with the HTMLExtractor.

0684Q00000ArMbuQAF.png

Reply