Solved

Extract the href from Website

8 years ago
January 27, 2017
7 replies
44 views

+45

danilo_fme
Evangelist
2054 replies

Hello all,

I have a Website that has a Links to extract XLS files:

http://www.economia-sniim.gob.mx/nuevo/

How can i read this Website and extract automatically theses XLS`s files?

Attached my Workspace with Transformers: Creator and HTTPCaller.

Thanks in Advanced,

Best answer by david_r

What you need is what's called web scraping. I guess it could be possible to accomplish your goal using a HTTPCaller combined with some StringSearcher and regular expressions, but I think it's not going to be super straight-forward seeing as the site in question uses iframes and some server-side re-directions just to make your life difficult.

For your specific web site you may have to resort to Python and some fancy parsing to get to your links.

View original

Did this help you find an answer to your question?

jeroenstiers
178 replies
8 years ago
January 27, 2017

Hi @danilo_inovacao

I've taken a look at the URL you provide. Once you can extract the paths to the .xls file, the issue is solved. But since the page uses an iframe, the HTTPCaller doesn't seem to find those paths.

I can however visualise the page, with the xls-paths in the following way:

view-source:http://www.economia-sniim.gob.mx/nuevo/AdministracionSitio/ListaServicios.aspx but I don't succeed in parsing this page in FME. Nor via the HTTPCaller, nor via the urllib2 module from Python. Maybe someone else know how to do this?

david_r
8332 replies
Best Answer
8 years ago
January 27, 2017

For your specific web site you may have to resort to Python and some fancy parsing to get to your links.

david_r
8332 replies
8 years ago
January 27, 2017

jeroenstiers wrote:

Hi @danilo_inovacao

I've taken a look at the URL you provide. Once you can extract the paths to the .xls file, the issue is solved. But since the page uses an iframe, the HTTPCaller doesn't seem to find those paths.

I can however visualise the page, with the xls-paths in the following way:

If you try to read that URL using the HTTPCaller you won't get what you expect, the site automatically redirects you to their site map, which is what the HTTPCaller returns:

Looks like the site admins made an effort of making it difficult to scrape it...

+17

itay
Supporter
1441 replies
8 years ago
January 27, 2017

have a look at the HTML extractor in the 2017RC.

+45

danilo_fme
Author
Evangelist
2054 replies
8 years ago
January 27, 2017

jeroenstiers wrote:

Hi @danilo_inovacao

I've taken a look at the URL you provide. Once you can extract the paths to the .xls file, the issue is solved. But since the page uses an iframe, the HTTPCaller doesn't seem to find those paths.

I can however visualise the page, with the xls-paths in the following way:

Hi @jeroenstiers thanks your help.

The Link that you send me is necessary to send my customer from this moment. But i will try using Python. Thanks :)

+45

danilo_fme
Author
Evangelist
2054 replies
8 years ago
January 27, 2017

david_r wrote:

For your specific web site you may have to resort to Python and some fancy parsing to get to your links.

Hi @david_r thanks your help me too. You are right, this site in question uses iframe and was very difficult to extract any information. By the way, i will look th Web Scraping in Python how did you say me.

Thanks :)

+45

danilo_fme
Author
Evangelist
2054 replies
8 years ago
January 27, 2017

itay wrote:

have a look at the HTML extractor in the 2017RC.

Hi @itay Thanks your help. I will look this new Transformer in FME 2017. Thank you :)

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Extract the href from Website