Skip to main content
Solved

Extract information from Web Page


danilo_fme
Evangelist
Forum|alt.badge.img+41

Hello all,

I'm try to extract a information from Webpage using the transformer HTMLExtractor, but i didn't a good result.

The Website is http://www.brasil.gov.br/ and the important information for me to extract is the href from this part of page:

Fale com o Governo

Attached my template file.

Thanks in Advance,

Danilo de Lima

Best answer by takashi

I sneaked a peek at the HTML source of the web page. There was no table tag, I think it's the reason why the error occurred.

You can get the HTML doc with the HTTPCaller and then parse the doc to extract the desired element. The HTMLExtractor might help you.

View original
Did this help you find an answer to your question?

8 replies

erik_jan
Contributor
Forum|alt.badge.img+17
  • Contributor
  • July 7, 2017

HI @danilo_inovacao

Have you tried the HTMLTable reader?


danilo_fme
Evangelist
Forum|alt.badge.img+41
  • Author
  • Evangelist
  • July 7, 2017
erik_jan wrote:

HI @danilo_inovacao

Have you tried the HTMLTable reader?

Hi @erik_jan thanks your help.

 

My customer will not use the Reader. For begin the Workspace he 'll use the transformer Creator.

 

 

Thanks in Advance,

 

 

 


erik_jan
Contributor
Forum|alt.badge.img+17
  • Contributor
  • July 7, 2017
danilo_fme wrote:
Hi @erik_jan thanks your help.

 

My customer will not use the Reader. For begin the Workspace he 'll use the transformer Creator.

 

 

Thanks in Advance,

 

 

 

Can you use a FeaureReader with this format?

 

 


danilo_fme
Evangelist
Forum|alt.badge.img+41
  • Author
  • Evangelist
  • July 7, 2017
erik_jan wrote:
Can you use a FeaureReader with this format?

 

 

I tried the use this transformer but a error happened.

 

 

mapping file Keyword: `R_4_+FME_DEBUG' occurs 5 time(s)

 

UniversalReader -- readSchema resulted in 0 schema features being returned

 

Failed to obtain any schemas from reader 'HTMLTABLE' from 1 datasets. This may be due to invalid datasets or format accessibility issues due to licensing, dependencies, or module loading. See logfile for more information

 

Failed to read schema features from dataset 'http://www.brasil.gov.br/ ' using the 'HTMLTABLE' reader

 

The dataset 'http://www.brasil.gov.br/ ' was closed successfully

 

The 'HTMLTABLE' reader was destroyed successfully

 

 

Thanks in Advance,

 

 


takashi
Contributor
Forum|alt.badge.img+19
  • Contributor
  • Best Answer
  • July 7, 2017

I sneaked a peek at the HTML source of the web page. There was no table tag, I think it's the reason why the error occurred.

You can get the HTML doc with the HTTPCaller and then parse the doc to extract the desired element. The HTMLExtractor might help you.


takashi
Contributor
Forum|alt.badge.img+19
  • Contributor
  • July 7, 2017
takashi wrote:

I sneaked a peek at the HTML source of the web page. There was no table tag, I think it's the reason why the error occurred.

You can get the HTML doc with the HTTPCaller and then parse the doc to extract the desired element. The HTMLExtractor might help you.

The web page (HTML document) seems to be created dynamically by a JavaScript script. It could be hard to extract the element with the HTTPCaller + HTMLExtractor.

 

 


davideagle
Contributor
Forum|alt.badge.img+21
  • Contributor
  • July 7, 2017
takashi wrote:
The web page (HTML document) seems to be created dynamically by a JavaScript script. It could be hard to extract the element with the HTTPCaller + HTMLExtractor.

 

 

I think that's the case. If you try to read the URL with the Text reader you get barely any content at all, something like 32 lines of HTML. So it appears to be a sub page that you might need to try to programmatically get to...

 

 


mark2atsafe
Safer
Forum|alt.badge.img+43

As far as I know, the HTMLExtractor won't read directly from a web site. The HTML has to come from a file, an attribute, or manually entered content. The Help doc suggests that and the error I get seems to confirm it:

'"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)

So your template will not work for sure. Instead add a HTTPCaller first and save the content to an attribute. But like @takashi says, I don't know if you'll get anything because the web page doesn't seem to want to return it. Maybe try a scraping tool to see if it's actually possible to get some data? Otherwise it seems unlikely FME will be able to do anything.


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings