Question

How to parse HTML file

  • 15 September 2016
  • 16 replies
  • 71 views

Badge

Hi,

I would to use FME elements to parse HTML file.

I have the following scenario. Navigating through various html pages, parse the html content of these pages, and then store the returned data in a specific format for further processing.

I have used HTTPCalled for calling the webpage and retrieved the html content in the 'response_body' field. I have tried to connect 'response_body' to HTMLToXMLConverter but unfortunately it didn't work.

So I wonder if there's a way to parse html content of webpages?

Thanks for your help!


16 replies

Userlevel 5
Badge +25

It kinda depends on how the HTML is set up really, can you tell us a bit more or show a sample?

The problem with HTML is that there's a lot possible in there that violates the rules but still gets rendered just fine by a browser. XML is a lot more structured.

Badge +16

Hi @fouly, did you change the response body encoding to system default? that usually does the trick

Userlevel 2
Badge +17

Hi @fouly, the current HTMLToXHTMLConverter doesn't support HTML5 tags. If the html doc has been created with the HTML5 specification, the transformer cannot be used unfortunately, and you will have to parse the doc with some transformers for string operations. The way depends on structure of the doc and your requirement.

Badge

Hi @fouly, did you change the response body encoding to system default? that usually does the trick

Good! Indeed, this did the trick!

Thanks:)

Userlevel 4
Badge +25

This question might also be of interest (for parsing HTML in general) and in 2017 beta there is already a HTMLExtractor transformer to parse HTML content.

Badge

This question might also be of interest (for parsing HTML in general) and in 2017 beta there is already a HTMLExtractor transformer to parse HTML content.

Hi Mark, I have downloaded FME 2017 to try out the HTMLEXTRACTOR transformer but couldn't find documentation for how to use it.

 

For example I have a HTML file that like the attached one, and I would like to extract the vars:

 

dataEnergyOut, dataPowerOut, dataPowerAvg, etc...

 

Do you maybe know how is that possible?view-source-pvoutputorg-intradayjsp-id38342sid3508.txt
Userlevel 4
Badge +25
Hi Mark, I have downloaded FME 2017 to try out the HTMLEXTRACTOR transformer but couldn't find documentation for how to use it.

 

For example I have a HTML file that like the attached one, and I would like to extract the vars:

 

dataEnergyOut, dataPowerOut, dataPowerAvg, etc...

 

Do you maybe know how is that possible?view-source-pvoutputorg-intradayjsp-id38342sid3508.txt
OK. So I checked and found some preliminary documentation. If I have content like <mytag details="abcd">.......</mytag> then I can fetch that tag by entering mytag into the CSS Selector column.

 

If I enter Whole in the Tag Part column I get all of the tag. If I enter Value I get what is between the tags. If I enter "details" then I get "abcd" back.

 

 

I think this is going to be harder for you because the content of your HTML is a dynamic script. It's not content like plain tags. For example, that display of information is scripted, not a simple 'table' tag. So the best you are going to get from here is probably to enter script as the tag. To get the actual data values you'll need to parse it some more. Probably StringSearcher transformers would help.

 

 

Alternatively, it is JavaScript, and so the data is probably a form of JSON. So you could always use a JSON transformer on it.

 

 

Hope this helps.

 

Userlevel 4
Badge +25
Hi Mark, I have downloaded FME 2017 to try out the HTMLEXTRACTOR transformer but couldn't find documentation for how to use it.

 

For example I have a HTML file that like the attached one, and I would like to extract the vars:

 

dataEnergyOut, dataPowerOut, dataPowerAvg, etc...

 

Do you maybe know how is that possible?view-source-pvoutputorg-intradayjsp-id38342sid3508.txt
Or... the transformer is based around BeautifulSoup - so if you are familiar with that you might be able to use a specialist syntax inside those parameters that I'm not aware of.

 

 

Badge
OK. So I checked and found some preliminary documentation. If I have content like <mytag details="abcd">.......</mytag> then I can fetch that tag by entering mytag into the CSS Selector column.

 

If I enter Whole in the Tag Part column I get all of the tag. If I enter Value I get what is between the tags. If I enter "details" then I get "abcd" back.

 

 

I think this is going to be harder for you because the content of your HTML is a dynamic script. It's not content like plain tags. For example, that display of information is scripted, not a simple 'table' tag. So the best you are going to get from here is probably to enter script as the tag. To get the actual data values you'll need to parse it some more. Probably StringSearcher transformers would help.

 

 

Alternatively, it is JavaScript, and so the data is probably a form of JSON. So you could always use a JSON transformer on it.

 

 

Hope this helps.

 

I tried the HTMLExtractor but didn't seem to work. The response_body contained the same exact content like the response_body from the transformer before. So I think I will give it a try using JSON transformer.

 

 

 

Userlevel 2
Badge +17
Hi Mark, I have downloaded FME 2017 to try out the HTMLEXTRACTOR transformer but couldn't find documentation for how to use it.

 

For example I have a HTML file that like the attached one, and I would like to extract the vars:

 

dataEnergyOut, dataPowerOut, dataPowerAvg, etc...

 

Do you maybe know how is that possible?view-source-pvoutputorg-intradayjsp-id38342sid3508.txt
Hi @fouly, In my quick test using FME 2017.0 Beta build 17156, I was able to extract the first <script> element from the html document you posted, with this setting.

 

@Mark2AtSafe, since the help doc on this transformer has not been bundled yet, I don't know whether this setting is correct. I feel something strange about the column name "CSS Selector". Is it correct that the user specify an HTML tag which should be extracted in the column?

 

And, the transformer seems to assume that tag names are specified in case-sensitive. It might be better if tag names would be treated in case-insensitive.

 

One more. It would be nice if the transformer could access a web page directly if the user set the URL to the HTML File parameter.
Userlevel 4
Badge +25
Hi @fouly, In my quick test using FME 2017.0 Beta build 17156, I was able to extract the first <script> element from the html document you posted, with this setting.

 

@Mark2AtSafe, since the help doc on this transformer has not been bundled yet, I don't know whether this setting is correct. I feel something strange about the column name "CSS Selector". Is it correct that the user specify an HTML tag which should be extracted in the column?

 

And, the transformer seems to assume that tag names are specified in case-sensitive. It might be better if tag names would be treated in case-insensitive.

 

One more. It would be nice if the transformer could access a web page directly if the user set the URL to the HTML File parameter.
Hi @takashi

 

Yes, I think that is the correct way of using this transformer. The problem is that there are several scripts in here, so the return format will need to be a list - and then finding the data will be awkward. It's not going to be simple in this particular example.

 

Those are some good ideas so I will pass them on to the developers.

 

Userlevel 1
Badge +22

Hi,

I've tried to use the HTMLExtractor transformer, but it seems unable to do what I deem to be a very common need.

I want to extract a list of elements (= city names) which is encased in a bullet list in HTML. This is the relevant part of the HTML:

<ul>
<li class="subfirst"><a href="/side.asp?Id=217481" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Auning (54)</span></a>
</li>
<li class="sublast"><a href="/side.asp?Id=216102" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Ebeltoft (2)</span></a>
</li> (...) </ul>

I just want the city names, i.e. "Auning (54)", "Ebeltoft (2)" etc.

How do I set up the transformer for this ? 

Preferably as multiple features, but a list will also work.

Cheers

Lars I.

Userlevel 1
Badge +22

Hi,

I've tried to use the HTMLExtractor transformer, but it seems unable to do what I deem to be a very common need.

I want to extract a list of elements (= city names) which is encased in a bullet list in HTML. This is the relevant part of the HTML:

<ul>
<li class="subfirst"><a href="/side.asp?Id=217481" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Auning (54)</span></a>
</li>
<li class="sublast"><a href="/side.asp?Id=216102" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Ebeltoft (2)</span></a>
</li> (...) </ul>

I just want the city names, i.e. "Auning (54)", "Ebeltoft (2)" etc.

How do I set up the transformer for this ? 

Preferably as multiple features, but a list will also work.

Cheers

Lars I.

Ah, found a way.

 

1st: HTMLEXtractor to extract the "li" elements (using class="subfirst")

 

2nd: HTMLToXHTMLConverter to convert it to XML

 

3rd: XMLFragmenter to extract the "li" elements (splitting the list)

 

4th: XMLFragmenter to extract the "span" element with parent and grandparent attributes (in Flattening option).

 

Result: elements with attributes "span" (the text) and "a.href" (the url link) :-)

 

 

Just thought I'd share this with y'all :-)

 

 

Cheers

 

Lars I.

 

 

Userlevel 2
Badge +17

Hi,

I've tried to use the HTMLExtractor transformer, but it seems unable to do what I deem to be a very common need.

I want to extract a list of elements (= city names) which is encased in a bullet list in HTML. This is the relevant part of the HTML:

<ul>
<li class="subfirst"><a href="/side.asp?Id=217481" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Auning (54)</span></a>
</li>
<li class="sublast"><a href="/side.asp?Id=216102" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Ebeltoft (2)</span></a>
</li> (...) </ul>

I just want the city names, i.e. "Auning (54)", "Ebeltoft (2)" etc.

How do I set up the transformer for this ? 

Preferably as multiple features, but a list will also work.

Cheers

Lars I.

Hi @lifalin2016, if the city names always are surrounded by span tag within li tag, the HTMLExtractor with this setting creates a list attribute called "_item{}" that stores all the city names.

 

FME 2017.0 Beta build 17190

 

0684Q00000ArMMGQA3.png

Userlevel 4
Badge +25
Hi @fouly, In my quick test using FME 2017.0 Beta build 17156, I was able to extract the first <script> element from the html document you posted, with this setting.

 

@Mark2AtSafe, since the help doc on this transformer has not been bundled yet, I don't know whether this setting is correct. I feel something strange about the column name "CSS Selector". Is it correct that the user specify an HTML tag which should be extracted in the column?

 

And, the transformer seems to assume that tag names are specified in case-sensitive. It might be better if tag names would be treated in case-insensitive.

 

One more. It would be nice if the transformer could access a web page directly if the user set the URL to the HTML File parameter.
Hi @takashi

 

As a quick follow-up, we've updated the HTMLExtractor to not be case sensitive. So I think it is good now (build 17196). We checked and CSS Selector is the best name for that column. The updated documentation helps to explain why. I also asked if we can access a web page directly. That's a bit more effort because we then need to incorporate all the other web fields (like proxies) in there - but it is under consideration.

 

Userlevel 2
Badge +17
Or... the transformer is based around BeautifulSoup - so if you are familiar with that you might be able to use a specialist syntax inside those parameters that I'm not aware of.

 

 

Hi @Mark2AtSafe, thanks for the update. I understood why the column name is "CSS Selector", once reading the updated help doc in the latest beta build. I agree that this is the best name.

 

Reply