Solved

HTMLExtractor - How to extract a specific value of a table in HTML page

Forum|Forum|5 years ago
July 9, 2021
6 replies
627 views

topotoma
Contributor

Hello Community,

I am trying to extract values of attributes in a table that is in a HTML page.

Actually the table is updated every 20 minutes or so on this URL:

https://www.hydrodaten.admin.ch/fr/2174.html

The table looks like this:

2021-07-09 22_27_06-Station I have tried to use a HTML Table reader but unfortunately the attributes names contain also a date / time and they are changing all the time, so if I extract it at regular intervals, it doesnt work anymore as the field name will have changed in between:

Table extracted

Then I have tried the Http Caller + HTML Extractor but I can't figure out which CSS Selector I should use to extract the values of the table.

HTMLExtractor Here is the HTML part of the webpage that contains the information I am looking for:

<div class="horizontal-scroll-wrapper">

<table class="table table-bordered table-narrow">

<thead>

<tr>

<th width="30%" scope="col">Mesures</th>

<th class="text-center" scope="col">DÃ©bit<br><small class="text-muted">09.07.2021 20:50</small><br><small class="text-muted">m<sup>3</sup>/s</small></th>

<th class="text-center" scope="col">Niveau d'eau<br><small class="text-muted">09.07.2021 20:50</small><br><small class="text-muted">m</small></th>

<th class="text-center" scope="col">TempÃ©rature<br><small class="text-muted">09.07.2021 20:50</small><br><small class="text-muted">Â°C</small></th>

</tr>

</thead>

<tbody>

<tr>

<th scope="row">DerniÃ¨re mesure</th>

<td class="text-center">744</td>

<td class="text-center">336.06</td>

<td class="text-center">17.4</td>

</tr>

I dont know how to get any further than this, it is disapointing as the information is there but I dont manage to extract it.

If some of you might have any pointers, I would be very grateful for it.

Many thanks for your time, which is indeed valuable.

Best regards.

Thomas

Best answer by topotoma

Finally, it seems I found a way, maybe not the best but I did it as follows.

I first inspected the webpage and noted down the following table details div.horizontal-scroll-wrapper :

Details webpage Which I reported in the HTML Extractor:

Then it gave me the table extraction as follows in an attribute:

<thead>

<tr>

<th scope="col" width="30%">Mesures</th>

<th class="text-center" scope="col">Niveau d'eau<br/><small class="text-muted">09.07.2021 21:40</small><br/><small class="text-muted">m</small></th>

<th class="text-center" scope="col">TempÃ©rature<br/><small class="text-muted">09.07.2021 21:40</small><br/><small class="text-muted">Â°C</small></th>

</tr>

</thead>

<tbody>

<tr>

<th scope="row">DerniÃ¨re mesure</th>

</tr>

<tr>

<th scope="row">Moyenne 24h</th>

</tr>

<tr>

<th scope="row">Maximum 24h</th>

</tr>

</tbody>

</table>

</div>

Then I added a second HTML Extractor to parse this new attribute, and used a list :

2021-07-09 23_43_41-HTMLExtractor Parameters

Then I had all the records and renamed the list with an attribute manager:

2021-07-09 23_49_27-_MULTI → NONE (Untitled) - FME Workbench 2021.0

I add this in case it might help someone else. It might not be the best method though.

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

topotoma
Author
Contributor
Best Answer
Forum|Forum|5 years ago
July 9, 2021

Finally, it seems I found a way, maybe not the best but I did it as follows.

I first inspected the webpage and noted down the following table details div.horizontal-scroll-wrapper :

Details webpage Which I reported in the HTML Extractor:

Then it gave me the table extraction as follows in an attribute:

<thead>

<tr>

<th scope="col" width="30%">Mesures</th>

<th class="text-center" scope="col">Niveau d'eau<br/><small class="text-muted">09.07.2021 21:40</small><br/><small class="text-muted">m</small></th>

<th class="text-center" scope="col">TempÃ©rature<br/><small class="text-muted">09.07.2021 21:40</small><br/><small class="text-muted">Â°C</small></th>

</tr>

</thead>

<tbody>

<tr>

<th scope="row">DerniÃ¨re mesure</th>

</tr>

<tr>

<th scope="row">Moyenne 24h</th>

</tr>

<tr>

<th scope="row">Maximum 24h</th>

</tr>

</tbody>

</table>

</div>

Then I added a second HTML Extractor to parse this new attribute, and used a list :

2021-07-09 23_43_41-HTMLExtractor Parameters

Then I had all the records and renamed the list with an attribute manager:

2021-07-09 23_49_27-_MULTI → NONE (Untitled) - FME Workbench 2021.0

I add this in case it might help someone else. It might not be the best method though.

Upvote

chrisatsafe
Safer
Forum|Forum|5 years ago
July 9, 2021

Hi @topotoma ,

How about this, use a FeatureReader to read in the HTML table in the Single Output Port mode, then use a BulkAttributeRenamer to remove the datetime/temperature portion of the attribute name using Regex and a date stamp with a wildcard.

@DateTimeFormat(@DateTimeNow(),%d\.%m\.%Y).*

2021-07-09_14-53-43 From there you can attach an AttributeManager to expose the attributes and continue processing or just attach it to your writer :)

Upvote

topotoma
Author
Contributor
Forum|Forum|5 years ago
July 9, 2021

Hi @topotoma ,

@DateTimeFormat(@DateTimeNow(),%d\.%m\.%Y).*

2021-07-09_14-53-43 From there you can attach an AttributeManager to expose the attributes and continue processing or just attach it to your writer :)

Hi Crisatsafe,

That's pretty neat, I tried to do that but i didn't know the syntax for renaming the attributes nor did I had a clue how to get it right.

However if I use your fmw the table has missing values at the end:

Results Maybe that's because the time stamp has changed but I couldnt see why as your parameters don't interfer with it.

Nevertheless may thanks for your quick reply and guidance to an elegant solution.

I also need to keep the original timestamp from the webpage, it is an useful information to know the exact time of the river level measurements, so my version with the HTMLExtractor does that at the moment.

Thanks again and best regards.

Thomas

Upvote

chrisatsafe
Safer
Forum|Forum|5 years ago
July 9, 2021

Hi Crisatsafe,

That's pretty neat, I tried to do that but i didn't know the syntax for renaming the attributes nor did I had a clue how to get it right.

However if I use your fmw the table has missing values at the end:

Results Maybe that's because the time stamp has changed but I couldnt see why as your parameters don't interfer with it.

Nevertheless may thanks for your quick reply and guidance to an elegant solution.

Thanks again and best regards.

Thomas

Ah perfect! Didn't see your update before posting that - glad you were able to get it all working 🙂

Just re-ran it and also noticed the missing values and that appears to be because of the time difference between my machine (PST) and the website time (CEST) so I had to update it to the following to make up the time difference (9 hours in my case) as it had already rolled into July 10 on that site:

@DateTimeFormat(@DateTimeAdd(@DateTimeNow(),PT9H),%d\.%m\.%Y).*

BUT not important since you have it all working now ;) great solution!

Upvote

topotoma
Author
Contributor
Forum|Forum|5 years ago
July 10, 2021

Hi Crisatsafe,

That's pretty neat, I tried to do that but i didn't know the syntax for renaming the attributes nor did I had a clue how to get it right.

However if I use your fmw the table has missing values at the end:

Results Maybe that's because the time stamp has changed but I couldnt see why as your parameters don't interfer with it.

Nevertheless may thanks for your quick reply and guidance to an elegant solution.

Thanks again and best regards.

Thomas

Thank you Chrisatsafe, that explains it all.

It would have been great to have the possibility to copy the value of the portion of the attribute name containing the date/time and reuse it later as a parameter.

Maybe with a Schema Reader, then with another transformer to extract a certain number of characters from the text. (a bit like 'right' formula in excel).

I didn't figure out how to do it yet.

Anyway, many thanks for your help and have a good day.

Best regards,

Thomas

Upvote

chrisatsafe
Safer
Forum|Forum|5 years ago
July 13, 2021

Hi Crisatsafe,

That's pretty neat, I tried to do that but i didn't know the syntax for renaming the attributes nor did I had a clue how to get it right.

However if I use your fmw the table has missing values at the end:

Results Maybe that's because the time stamp has changed but I couldnt see why as your parameters don't interfer with it.

Nevertheless may thanks for your quick reply and guidance to an elegant solution.

Thanks again and best regards.

Thomas

That is something that is possible using the SubstringExtractor or using a String Function like GetWord in the AttributeManager (or even a StringExtractor since we know the regex for it). The string functions will behave similar to all of the string functions in excel so you could certainly extract the info from the header and include it in an attribute if needed.

2021-07-13_8-50-38

Upvote

Community Stats

Sign up

An FME Account is required to contribute

Login to the community

An FME Account is required to contribute