Solved

Extract information from Web Page

7 years ago
July 7, 2017
8 replies
266 views

+45

danilo_fme
Evangelist
2056 replies

Hello all,

I'm try to extract a information from Webpage using the transformer HTMLExtractor, but i didn't a good result.

The Website is http://www.brasil.gov.br/ and the important information for me to extract is the href from this part of page:

Fale com o Governo

Attached my template file.

Thanks in Advance,

Danilo de Lima

Best answer by takashi

I sneaked a peek at the HTML source of the web page. There was no table tag, I think it's the reason why the error occurred.

You can get the HTML doc with the HTTPCaller and then parse the doc to extract the desired element. The HTMLExtractor might help you.

View original

Did this help you find an answer to your question?

+18

erik_jan
Contributor
2181 replies
7 years ago
July 7, 2017

HI @danilo_inovacao

Have you tried the HTMLTable reader?

+45

danilo_fme
Author
Evangelist
2056 replies
7 years ago
July 7, 2017

erik_jan wrote:

HI @danilo_inovacao

Have you tried the HTMLTable reader?

Hi @erik_jan thanks your help.

My customer will not use the Reader. For begin the Workspace he 'll use the transformer Creator.

Thanks in Advance,

+18

erik_jan
Contributor
2181 replies
7 years ago
July 7, 2017

danilo_fme wrote:

Hi @erik_jan thanks your help.

My customer will not use the Reader. For begin the Workspace he 'll use the transformer Creator.

Thanks in Advance,

Can you use a FeaureReader with this format?

+45

danilo_fme
Author
Evangelist
2056 replies
7 years ago
July 7, 2017

erik_jan wrote:

Can you use a FeaureReader with this format?

I tried the use this transformer but a error happened.

mapping file Keyword: `R_4_+FME_DEBUG' occurs 5 time(s)

UniversalReader -- readSchema resulted in 0 schema features being returned

Failed to obtain any schemas from reader 'HTMLTABLE' from 1 datasets. This may be due to invalid datasets or format accessibility issues due to licensing, dependencies, or module loading. See logfile for more information

Failed to read schema features from dataset 'http://www.brasil.gov.br/ ' using the 'HTMLTABLE' reader

The dataset 'http://www.brasil.gov.br/ ' was closed successfully

The 'HTMLTABLE' reader was destroyed successfully

Thanks in Advance,

takashi
7665 replies
Best Answer
7 years ago
July 7, 2017

I sneaked a peek at the HTML source of the web page. There was no table tag, I think it's the reason why the error occurred.

You can get the HTML doc with the HTTPCaller and then parse the doc to extract the desired element. The HTMLExtractor might help you.

takashi
7665 replies
7 years ago
July 7, 2017

takashi wrote:

I sneaked a peek at the HTML source of the web page. There was no table tag, I think it's the reason why the error occurred.

You can get the HTML doc with the HTTPCaller and then parse the doc to extract the desired element. The HTMLExtractor might help you.

The web page (HTML document) seems to be created dynamically by a JavaScript script. It could be hard to extract the element with the HTTPCaller + HTMLExtractor.

+21

davideagle
Contributor
578 replies
7 years ago
July 7, 2017

takashi wrote:

The web page (HTML document) seems to be created dynamically by a JavaScript script. It could be hard to extract the element with the HTTPCaller + HTMLExtractor.

I think that's the case. If you try to read the URL with the Text reader you get barely any content at all, something like 32 lines of HTML. So it appears to be a sub page that you might need to try to programmatically get to...

+44

mark2atsafe
Safer
2520 replies
7 years ago
July 10, 2017

As far as I know, the HTMLExtractor won't read directly from a web site. The HTML has to come from a file, an attribute, or manually entered content. The Help doc suggests that and the error I get seems to confirm it:

'"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)

So your template will not work for sure. Instead add a HTTPCaller first and save the content to an attribute. But like @takashi says, I don't know if you'll get anything because the web page doesn't seem to want to return it. Maybe try a scraping tool to see if it's actually possible to get some data? Otherwise it seems unlikely FME will be able to do anything.

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Extract information from Web Page