Skip to main content

Hello all,

I'm try to extract a information from Webpage using the transformer HTMLExtractor, but i didn't a good result.

The Website is http://www.brasil.gov.br/ and the important information for me to extract is the href from this part of page:

Fale com o Governo

Attached my template file.

Thanks in Advance,

Danilo de Lima

HI @danilo_inovacao

Have you tried the HTMLTable reader?


HI @danilo_inovacao

Have you tried the HTMLTable reader?

Hi @erik_jan thanks your help.

 

My customer will not use the Reader. For begin the Workspace he 'll use the transformer Creator.

 

 

Thanks in Advance,

 

 

 


Hi @erik_jan thanks your help.

 

My customer will not use the Reader. For begin the Workspace he 'll use the transformer Creator.

 

 

Thanks in Advance,

 

 

 

Can you use a FeaureReader with this format?

 

 


Can you use a FeaureReader with this format?

 

 

I tried the use this transformer but a error happened.

 

 

mapping file Keyword: `R_4_+FME_DEBUG' occurs 5 time(s)

 


UniversalReader -- readSchema resulted in 0 schema features being returned

 


Failed to obtain any schemas from reader 'HTMLTABLE' from 1 datasets. This may be due to invalid datasets or format accessibility issues due to licensing, dependencies, or module loading. See logfile for more information

 


Failed to read schema features from dataset 'http://www.brasil.gov.br/ ' using the 'HTMLTABLE' reader

 


The dataset 'http://www.brasil.gov.br/ ' was closed successfully

 


The 'HTMLTABLE' reader was destroyed successfully

 


 

Thanks in Advance,

 

 


I sneaked a peek at the HTML source of the web page. There was no table tag, I think it's the reason why the error occurred.

You can get the HTML doc with the HTTPCaller and then parse the doc to extract the desired element. The HTMLExtractor might help you.


I sneaked a peek at the HTML source of the web page. There was no table tag, I think it's the reason why the error occurred.

You can get the HTML doc with the HTTPCaller and then parse the doc to extract the desired element. The HTMLExtractor might help you.

The web page (HTML document) seems to be created dynamically by a JavaScript script. It could be hard to extract the element with the HTTPCaller + HTMLExtractor.

 

 


The web page (HTML document) seems to be created dynamically by a JavaScript script. It could be hard to extract the element with the HTTPCaller + HTMLExtractor.

 

 

I think that's the case. If you try to read the URL with the Text reader you get barely any content at all, something like 32 lines of HTML. So it appears to be a sub page that you might need to try to programmatically get to...

 

 


As far as I know, the HTMLExtractor won't read directly from a web site. The HTML has to come from a file, an attribute, or manually entered content. The Help doc suggests that and the error I get seems to confirm it:

'"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)

So your template will not work for sure. Instead add a HTTPCaller first and save the content to an attribute. But like @takashi says, I don't know if you'll get anything because the web page doesn't seem to want to return it. Maybe try a scraping tool to see if it's actually possible to get some data? Otherwise it seems unlikely FME will be able to do anything.


Reply