It kinda depends on how the HTML is set up really, can you tell us a bit more or show a sample?
The problem with HTML is that there's a lot possible in there that violates the rules but still gets rendered just fine by a browser. XML is a lot more structured.
Hi @fouly, did you change the response body encoding to system default? that usually does the trick
Hi @fouly, the current HTMLToXHTMLConverter doesn't support HTML5 tags. If the html doc has been created with the HTML5 specification, the transformer cannot be used unfortunately, and you will have to parse the doc with some transformers for string operations. The way depends on structure of the doc and your requirement.
Hi @fouly, did you change the response body encoding to system default? that usually does the trick
Good! Indeed, this did the trick!
Thanks:)
This question might also be of interest (for parsing HTML in general) and in 2017 beta there is already a HTMLExtractor transformer to parse HTML content.
This question might also be of interest (for parsing HTML in general) and in 2017 beta there is already a HTMLExtractor transformer to parse HTML content.
Hi Mark, I have downloaded FME 2017 to try out the HTMLEXTRACTOR transformer but couldn't find documentation for how to use it.
For example I have a HTML file that like the attached one, and I would like to extract the vars:
dataEnergyOut, dataPowerOut, dataPowerAvg, etc...
Do you maybe know how is that possible?
view-source-pvoutputorg-intradayjsp-id38342sid3508.txt
Hi Mark, I have downloaded FME 2017 to try out the HTMLEXTRACTOR transformer but couldn't find documentation for how to use it.
For example I have a HTML file that like the attached one, and I would like to extract the vars:
dataEnergyOut, dataPowerOut, dataPowerAvg, etc...
Do you maybe know how is that possible?
view-source-pvoutputorg-intradayjsp-id38342sid3508.txtOK. So I checked and found some preliminary documentation. If I have content like <mytag details="abcd">.......</mytag> then I can fetch that tag by entering mytag into the CSS Selector column.
If I enter Whole in the Tag Part column I get all of the tag. If I enter Value I get what is between the tags. If I enter "details" then I get "abcd" back.
I think this is going to be harder for you because the content of your HTML is a dynamic script. It's not content like plain tags. For example, that display of information is scripted, not a simple 'table' tag. So the best you are going to get from here is probably to enter script as the tag. To get the actual data values you'll need to parse it some more. Probably StringSearcher transformers would help.
Alternatively, it is JavaScript, and so the data is probably a form of JSON. So you could always use a JSON transformer on it.
Hope this helps.
Hi Mark, I have downloaded FME 2017 to try out the HTMLEXTRACTOR transformer but couldn't find documentation for how to use it.
For example I have a HTML file that like the attached one, and I would like to extract the vars:
dataEnergyOut, dataPowerOut, dataPowerAvg, etc...
Do you maybe know how is that possible?
view-source-pvoutputorg-intradayjsp-id38342sid3508.txtOr... the transformer is based around BeautifulSoup - so if you are familiar with that you might be able to use a specialist syntax inside those parameters that I'm not aware of.
OK. So I checked and found some preliminary documentation. If I have content like <mytag details="abcd">.......</mytag> then I can fetch that tag by entering mytag into the CSS Selector column.
If I enter Whole in the Tag Part column I get all of the tag. If I enter Value I get what is between the tags. If I enter "details" then I get "abcd" back.
I think this is going to be harder for you because the content of your HTML is a dynamic script. It's not content like plain tags. For example, that display of information is scripted, not a simple 'table' tag. So the best you are going to get from here is probably to enter script as the tag. To get the actual data values you'll need to parse it some more. Probably StringSearcher transformers would help.
Alternatively, it is JavaScript, and so the data is probably a form of JSON. So you could always use a JSON transformer on it.
Hope this helps.
I tried the HTMLExtractor but didn't seem to work. The response_body contained the same exact content like the response_body from the transformer before. So I think I will give it a try using JSON transformer.
Hi Mark, I have downloaded FME 2017 to try out the HTMLEXTRACTOR transformer but couldn't find documentation for how to use it.
For example I have a HTML file that like the attached one, and I would like to extract the vars:
dataEnergyOut, dataPowerOut, dataPowerAvg, etc...
Do you maybe know how is that possible?
view-source-pvoutputorg-intradayjsp-id38342sid3508.txtHi @fouly, In my quick test using FME 2017.0 Beta build 17156, I was able to extract the first <script> element from the html document you posted, with this setting.
@Mark2AtSafe, since the help doc on this transformer has not been bundled yet, I don't know whether this setting is correct. I feel something strange about the column name "CSS Selector". Is it correct that the user specify an HTML tag which should be extracted in the column?
And, the transformer seems to assume that tag names are specified in case-sensitive. It might be better if tag names would be treated in case-insensitive.
One more. It would be nice if the transformer could access a web page directly if the user set the URL to the HTML File parameter.
Hi @fouly, In my quick test using FME 2017.0 Beta build 17156, I was able to extract the first <script> element from the html document you posted, with this setting.
@Mark2AtSafe, since the help doc on this transformer has not been bundled yet, I don't know whether this setting is correct. I feel something strange about the column name "CSS Selector". Is it correct that the user specify an HTML tag which should be extracted in the column?
And, the transformer seems to assume that tag names are specified in case-sensitive. It might be better if tag names would be treated in case-insensitive.
One more. It would be nice if the transformer could access a web page directly if the user set the URL to the HTML File parameter.
Hi @takashi
Yes, I think that is the correct way of using this transformer. The problem is that there are several scripts in here, so the return format will need to be a list - and then finding the data will be awkward. It's not going to be simple in this particular example.
Those are some good ideas so I will pass them on to the developers.
Hi,
I've tried to use the HTMLExtractor transformer, but it seems unable to do what I deem to be a very common need.
I want to extract a list of elements (= city names) which is encased in a bullet list in HTML. This is the relevant part of the HTML:
<ul>
<li class="subfirst"><a href="/side.asp?Id=217481" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Auning (54)</span></a>
</li>
<li class="sublast"><a href="/side.asp?Id=216102" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Ebeltoft (2)</span></a>
</li> (...) </ul>
I just want the city names, i.e. "Auning (54)", "Ebeltoft (2)" etc.
How do I set up the transformer for this ?
Preferably as multiple features, but a list will also work.
Cheers
Lars I.
Hi,
I've tried to use the HTMLExtractor transformer, but it seems unable to do what I deem to be a very common need.
I want to extract a list of elements (= city names) which is encased in a bullet list in HTML. This is the relevant part of the HTML:
<ul>
<li class="subfirst"><a href="/side.asp?Id=217481" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Auning (54)</span></a>
</li>
<li class="sublast"><a href="/side.asp?Id=216102" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Ebeltoft (2)</span></a>
</li> (...) </ul>
I just want the city names, i.e. "Auning (54)", "Ebeltoft (2)" etc.
How do I set up the transformer for this ?
Preferably as multiple features, but a list will also work.
Cheers
Lars I.
Ah, found a way.
1st: HTMLEXtractor to extract the "li" elements (using class="subfirst")
2nd: HTMLToXHTMLConverter to convert it to XML
3rd: XMLFragmenter to extract the "li" elements (splitting the list)
4th: XMLFragmenter to extract the "span" element with parent and grandparent attributes (in Flattening option).
Result: elements with attributes "span" (the text) and "a.href" (the url link) :-)
Just thought I'd share this with y'all :-)
Cheers
Lars I.
Hi,
I've tried to use the HTMLExtractor transformer, but it seems unable to do what I deem to be a very common need.
I want to extract a list of elements (= city names) which is encased in a bullet list in HTML. This is the relevant part of the HTML:
<ul>
<li class="subfirst"><a href="/side.asp?Id=217481" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Auning (54)</span></a>
</li>
<li class="sublast"><a href="/side.asp?Id=216102" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Ebeltoft (2)</span></a>
</li> (...) </ul>
I just want the city names, i.e. "Auning (54)", "Ebeltoft (2)" etc.
How do I set up the transformer for this ?
Preferably as multiple features, but a list will also work.
Cheers
Lars I.
Hi @lifalin2016, if the city names always are surrounded by span tag within li tag, the HTMLExtractor with this setting creates a list attribute called "_item{}" that stores all the city names.
FME 2017.0 Beta build 17190
Hi @fouly, In my quick test using FME 2017.0 Beta build 17156, I was able to extract the first <script> element from the html document you posted, with this setting.
@Mark2AtSafe, since the help doc on this transformer has not been bundled yet, I don't know whether this setting is correct. I feel something strange about the column name "CSS Selector". Is it correct that the user specify an HTML tag which should be extracted in the column?
And, the transformer seems to assume that tag names are specified in case-sensitive. It might be better if tag names would be treated in case-insensitive.
One more. It would be nice if the transformer could access a web page directly if the user set the URL to the HTML File parameter.
Hi @takashi
As a quick follow-up, we've updated the HTMLExtractor to not be case sensitive. So I think it is good now (build 17196). We checked and CSS Selector is the best name for that column. The updated documentation helps to explain why. I also asked if we can access a web page directly. That's a bit more effort because we then need to incorporate all the other web fields (like proxies) in there - but it is under consideration.
Or... the transformer is based around BeautifulSoup - so if you are familiar with that you might be able to use a specialist syntax inside those parameters that I'm not aware of.
Hi @Mark2AtSafe, thanks for the update. I understood why the column name is "CSS Selector", once reading the updated help doc in the latest beta build. I agree that this is the best name.