Question

How to parse HTML file

Forum|Forum|9 years ago
September 15, 2016
16 replies
680 views

fouly

Hi,

I would to use FME elements to parse HTML file.

I have the following scenario. Navigating through various html pages, parse the html content of these pages, and then store the returned data in a specific format for further processing.

I have used HTTPCalled for calling the webpage and retrieved the html content in the 'response_body' field. I have tried to connect 'response_body' to HTMLToXMLConverter but unfortunately it didn't work.

So I wonder if there's a way to parse html content of webpages?

Thanks for your help!

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

+62

redgeographics
Celebrity
Forum|Forum|9 years ago
September 15, 2016

It kinda depends on how the HTML is set up really, can you tell us a bit more or show a sample?

The problem with HTML is that there's a lot possible in there that violates the rules but still gets rendered just fine by a browser. XML is a lot more structured.

FME rocks! \m/

Upvote

+18

itay
Supporter
Forum|Forum|9 years ago
September 15, 2016

Hi @fouly, did you change the response body encoding to system default? that usually does the trick

Upvote

takashi
Forum|Forum|9 years ago
September 15, 2016

Hi @fouly, the current HTMLToXHTMLConverter doesn't support HTML5 tags. If the html doc has been created with the HTML5 specification, the transformer cannot be used unfortunately, and you will have to parse the doc with some transformers for string operations. The way depends on structure of the doc and your requirement.

Why not inspect features with Visual/Data Preview and Feature/Record Information before writing them into a destination dataset?

Upvote

fouly
Author
Forum|Forum|9 years ago
September 15, 2016

Hi @fouly, did you change the response body encoding to system default? that usually does the trick

Good! Indeed, this did the trick!

Thanks:)

Upvote

+59

mark2atsafe
Safer
Forum|Forum|9 years ago
September 15, 2016

This question might also be of interest (for parsing HTML in general) and in 2017 beta there is already a HTMLExtractor transformer to parse HTML content.

FME Evangelist to the Rich and Famous!!

Upvote

fouly
Author
Forum|Forum|9 years ago
September 26, 2016

This question might also be of interest (for parsing HTML in general) and in 2017 beta there is already a HTMLExtractor transformer to parse HTML content.

Hi Mark, I have downloaded FME 2017 to try out the HTMLEXTRACTOR transformer but couldn't find documentation for how to use it.

For example I have a HTML file that like the attached one, and I would like to extract the vars:

dataEnergyOut, dataPowerOut, dataPowerAvg, etc...

Do you maybe know how is that possible?view-source-pvoutputorg-intradayjsp-id38342sid3508.txt

Upvote

+59

mark2atsafe
Safer
Forum|Forum|9 years ago
September 26, 2016

Hi Mark, I have downloaded FME 2017 to try out the HTMLEXTRACTOR transformer but couldn't find documentation for how to use it.

For example I have a HTML file that like the attached one, and I would like to extract the vars:

dataEnergyOut, dataPowerOut, dataPowerAvg, etc...

Do you maybe know how is that possible?view-source-pvoutputorg-intradayjsp-id38342sid3508.txt

OK. So I checked and found some preliminary documentation. If I have content like <mytag details="abcd">.......</mytag> then I can fetch that tag by entering mytag into the CSS Selector column.

If I enter Whole in the Tag Part column I get all of the tag. If I enter Value I get what is between the tags. If I enter "details" then I get "abcd" back.

I think this is going to be harder for you because the content of your HTML is a dynamic script. It's not content like plain tags. For example, that display of information is scripted, not a simple 'table' tag. So the best you are going to get from here is probably to enter script as the tag. To get the actual data values you'll need to parse it some more. Probably StringSearcher transformers would help.

Alternatively, it is JavaScript, and so the data is probably a form of JSON. So you could always use a JSON transformer on it.

Hope this helps.

FME Evangelist to the Rich and Famous!!

Upvote

+59

mark2atsafe
Safer
Forum|Forum|9 years ago
September 26, 2016

Hi Mark, I have downloaded FME 2017 to try out the HTMLEXTRACTOR transformer but couldn't find documentation for how to use it.

For example I have a HTML file that like the attached one, and I would like to extract the vars:

dataEnergyOut, dataPowerOut, dataPowerAvg, etc...

Do you maybe know how is that possible?view-source-pvoutputorg-intradayjsp-id38342sid3508.txt

Or... the transformer is based around BeautifulSoup - so if you are familiar with that you might be able to use a specialist syntax inside those parameters that I'm not aware of.

FME Evangelist to the Rich and Famous!!

Upvote

fouly
Author
Forum|Forum|9 years ago
September 27, 2016

OK. So I checked and found some preliminary documentation. If I have content like <mytag details="abcd">.......</mytag> then I can fetch that tag by entering mytag into the CSS Selector column.

If I enter Whole in the Tag Part column I get all of the tag. If I enter Value I get what is between the tags. If I enter "details" then I get "abcd" back.

Alternatively, it is JavaScript, and so the data is probably a form of JSON. So you could always use a JSON transformer on it.

Hope this helps.

I tried the HTMLExtractor but didn't seem to work. The response_body contained the same exact content like the response_body from the transformer before. So I think I will give it a try using JSON transformer.

Upvote

takashi
Forum|Forum|9 years ago
September 27, 2016

Hi Mark, I have downloaded FME 2017 to try out the HTMLEXTRACTOR transformer but couldn't find documentation for how to use it.

For example I have a HTML file that like the attached one, and I would like to extract the vars:

dataEnergyOut, dataPowerOut, dataPowerAvg, etc...

Do you maybe know how is that possible?view-source-pvoutputorg-intradayjsp-id38342sid3508.txt

Hi @fouly, In my quick test using FME 2017.0 Beta build 17156, I was able to extract the first <script> element from the html document you posted, with this setting.

@Mark2AtSafe, since the help doc on this transformer has not been bundled yet, I don't know whether this setting is correct. I feel something strange about the column name "CSS Selector". Is it correct that the user specify an HTML tag which should be extracted in the column?

And, the transformer seems to assume that tag names are specified in case-sensitive. It might be better if tag names would be treated in case-insensitive.

One more. It would be nice if the transformer could access a web page directly if the user set the URL to the HTML File parameter.

Why not inspect features with Visual/Data Preview and Feature/Record Information before writing them into a destination dataset?

Upvote

+59

mark2atsafe
Safer
Forum|Forum|9 years ago
September 27, 2016

Hi @fouly, In my quick test using FME 2017.0 Beta build 17156, I was able to extract the first <script> element from the html document you posted, with this setting.

And, the transformer seems to assume that tag names are specified in case-sensitive. It might be better if tag names would be treated in case-insensitive.

One more. It would be nice if the transformer could access a web page directly if the user set the URL to the HTML File parameter.

Hi @takashi

Yes, I think that is the correct way of using this transformer. The problem is that there are several scripts in here, so the return format will need to be a list - and then finding the data will be awkward. It's not going to be simple in this particular example.

Those are some good ideas so I will pass them on to the developers.

FME Evangelist to the Rich and Famous!!

Upvote

+40

lifalin2016
Supporter
Forum|Forum|9 years ago
November 17, 2016

Hi,

I've tried to use the HTMLExtractor transformer, but it seems unable to do what I deem to be a very common need.

I want to extract a list of elements (= city names) which is encased in a bullet list in HTML. This is the relevant part of the HTML:

<ul>
<li class="subfirst"><a href="/side.asp?Id=217481" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Auning (54)</span></a>
</li>
<li class="sublast"><a href="/side.asp?Id=216102" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Ebeltoft (2)</span></a>
</li> (...) </ul>

I just want the city names, i.e. "Auning (54)", "Ebeltoft (2)" etc.

How do I set up the transformer for this ?

Preferably as multiple features, but a list will also work.

Cheers

Lars I.

-- Cheers, Lars I.

Upvote

+40

lifalin2016
Supporter
Forum|Forum|9 years ago
November 17, 2016

Hi,

I've tried to use the HTMLExtractor transformer, but it seems unable to do what I deem to be a very common need.

I want to extract a list of elements (= city names) which is encased in a bullet list in HTML. This is the relevant part of the HTML:

<ul>
<li class="subfirst"><a href="/side.asp?Id=217481" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Auning (54)</span></a>
</li>
<li class="sublast"><a href="/side.asp?Id=216102" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Ebeltoft (2)</span></a>
</li> (...) </ul>

I just want the city names, i.e. "Auning (54)", "Ebeltoft (2)" etc.

How do I set up the transformer for this ?

Preferably as multiple features, but a list will also work.

Cheers

Lars I.

Ah, found a way.

1st: HTMLEXtractor to extract the "li" elements (using class="subfirst")

2nd: HTMLToXHTMLConverter to convert it to XML

3rd: XMLFragmenter to extract the "li" elements (splitting the list)

4th: XMLFragmenter to extract the "span" element with parent and grandparent attributes (in Flattening option).

Result: elements with attributes "span" (the text) and "a.href" (the url link) :-)

Just thought I'd share this with y'all :-)

Cheers

Lars I.

-- Cheers, Lars I.

Upvote

takashi
Forum|Forum|9 years ago
November 17, 2016

Hi,

I've tried to use the HTMLExtractor transformer, but it seems unable to do what I deem to be a very common need.

I want to extract a list of elements (= city names) which is encased in a bullet list in HTML. This is the relevant part of the HTML:

<ul>
<li class="subfirst"><a href="/side.asp?Id=217481" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Auning (54)</span></a>
</li>
<li class="sublast"><a href="/side.asp?Id=216102" style="height:14px;line-height:14px;"><span style="font-family: Arial; font-size: 14px;">Ebeltoft (2)</span></a>
</li> (...) </ul>

I just want the city names, i.e. "Auning (54)", "Ebeltoft (2)" etc.

How do I set up the transformer for this ?

Preferably as multiple features, but a list will also work.

Cheers

Lars I.

Hi @lifalin2016, if the city names always are surrounded by span tag within li tag, the HTMLExtractor with this setting creates a list attribute called "_item{}" that stores all the city names.

FME 2017.0 Beta build 17190

Why not inspect features with Visual/Data Preview and Feature/Record Information before writing them into a destination dataset?

Upvote

+59

mark2atsafe
Safer
Forum|Forum|9 years ago
November 23, 2016

Hi @fouly, In my quick test using FME 2017.0 Beta build 17156, I was able to extract the first <script> element from the html document you posted, with this setting.

And, the transformer seems to assume that tag names are specified in case-sensitive. It might be better if tag names would be treated in case-insensitive.

One more. It would be nice if the transformer could access a web page directly if the user set the URL to the HTML File parameter.

Hi @takashi

As a quick follow-up, we've updated the HTMLExtractor to not be case sensitive. So I think it is good now (build 17196). We checked and CSS Selector is the best name for that column. The updated documentation helps to explain why. I also asked if we can access a web page directly. That's a bit more effort because we then need to incorporate all the other web fields (like proxies) in there - but it is under consideration.

FME Evangelist to the Rich and Famous!!

Upvote

takashi
Forum|Forum|9 years ago
November 24, 2016

Or... the transformer is based around BeautifulSoup - so if you are familiar with that you might be able to use a specialist syntax inside those parameters that I'm not aware of.

Hi @Mark2AtSafe, thanks for the update. I understood why the column name is "CSS Selector", once reading the updated help doc in the latest beta build. I agree that this is the best name.

Why not inspect features with Visual/Data Preview and Feature/Record Information before writing them into a destination dataset?

Upvote

How to parse HTML file

16 replies

Community Stats

Latest FME

Community Stats

Latest FME

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute

Scanning file for viruses.

This file cannot be downloaded