Solved

HTMLExtractor - How to extract a specific value of a table in HTML page


Hello Community,

 

I am trying to extract values of attributes in a table that is in a HTML page.

Actually the table is updated every 20 minutes or so on this URL:

 

https://www.hydrodaten.admin.ch/fr/2174.html

 

The table looks like this:

2021-07-09 22_27_06-StationI have tried to use a HTML Table reader but unfortunately the attributes names contain also a date / time and they are changing all the time, so if I extract it at regular intervals, it doesnt work anymore as the field name will have changed in between:

 

Table extracted 

Then I have tried the Http Caller + HTML Extractor but I can't figure out which CSS Selector I should use to extract the values of the table.

 

HTMLExtractorHere is the HTML part of the webpage that contains the information I am looking for:

 

<div class="horizontal-scroll-wrapper">

 <table class="table table-bordered table-narrow">

  <thead>

  <tr>

   <th width="30%" scope="col">Mesures</th>

    <th class="text-center" scope="col">Débit<br><small class="text-muted">09.07.2021 20:50</small><br><small class="text-muted">m<sup>3</sup>/s</small></th>

    <th class="text-center" scope="col">Niveau d&#39;eau<br><small class="text-muted">09.07.2021 20:50</small><br><small class="text-muted">m</small></th>

    <th class="text-center" scope="col">Température<br><small class="text-muted">09.07.2021 20:50</small><br><small class="text-muted">°C</small></th>

  </tr>

  </thead>

  <tbody>

  <tr>

   <th scope="row">Dernière mesure</th>

   <td class="text-center">744</td>

   <td class="text-center">336.06</td>

   <td class="text-center">17.4</td>

  </tr>

 

I dont know how to get any further than this, it is disapointing as the information is there but I dont manage to extract it.

 

If some of you might have any pointers, I would be very grateful for it.

Many thanks for your time, which is indeed valuable.

 

Best regards.

Thomas

icon

Best answer by topotoma 9 July 2021, 23:50

View original

6 replies

Finally, it seems I found a way, maybe not the best but I did it as follows.

 

I first inspected the webpage and noted down the following table details div.horizontal-scroll-wrapper :

Details webpageWhich I reported in the HTML Extractor:

1Then it gave me the table extraction as follows in an attribute:

<div class="horizontal-scroll-wrapper">

<table class="table table-bordered table-narrow">

<thead>

<tr>

<th scope="col" width="30%">Mesures</th>

<th class="text-center" scope="col">Débit<br/><small class="text-muted">09.07.2021 21:40</small><br/><small class="text-muted">m<sup>3</sup>/s</small></th>

<th class="text-center" scope="col">Niveau d'eau<br/><small class="text-muted">09.07.2021 21:40</small><br/><small class="text-muted">m</small></th>

<th class="text-center" scope="col">Température<br/><small class="text-muted">09.07.2021 21:40</small><br/><small class="text-muted">°C</small></th>

</tr>

</thead>

<tbody>

<tr>

<th scope="row">Dernière mesure</th>

<td class="text-center">739</td>

<td class="text-center">336.05</td>

<td class="text-center">17.5</td>

</tr>

<tr>

<th scope="row">Moyenne 24h</th>

<td class="text-center">828</td>

<td class="text-center">336.30</td>

<td class="text-center">16.5</td>

</tr>

<tr>

<th scope="row">Maximum 24h</th>

<td class="text-center">897</td>

<td class="text-center">336.50</td>

<td class="text-center">17.5</td>

</tr>

</tbody>

</table>

</div>

Then I added a second HTML Extractor to parse this new attribute, and used a list :

2021-07-09 23_43_41-HTMLExtractor Parameters 

Then I had all the records and renamed the list with an attribute manager:

 

2021-07-09 23_49_27-_MULTI → NONE (Untitled) - FME Workbench 2021.0 

I add this in case it might help someone else. It might not be the best method though.

 

 

 

Badge +2

Hi @topotoma​ ,

 

How about this, use a FeatureReader to read in the HTML table in the Single Output Port mode, then use a BulkAttributeRenamer to remove the datetime/temperature portion of the attribute name using Regex and a date stamp with a wildcard. 

@DateTimeFormat(@DateTimeNow(),%d\.%m\.%Y).*

 

2021-07-09_14-53-43From there you can attach an AttributeManager to expose the attributes and continue processing or just attach it to your writer :) 

Hi @topotoma​ ,

 

How about this, use a FeatureReader to read in the HTML table in the Single Output Port mode, then use a BulkAttributeRenamer to remove the datetime/temperature portion of the attribute name using Regex and a date stamp with a wildcard. 

@DateTimeFormat(@DateTimeNow(),%d\.%m\.%Y).*

 

2021-07-09_14-53-43From there you can attach an AttributeManager to expose the attributes and continue processing or just attach it to your writer :) 

Hi Crisatsafe,

 

That's pretty neat, I tried to do that but i didn't know the syntax for renaming the attributes nor did I had a clue how to get it right.

 

However if I use your fmw the table has missing values at the end:

ResultsMaybe that's because the time stamp has changed but I couldnt see why as your parameters don't interfer with it.

 

Nevertheless may thanks for your quick reply and guidance to an elegant solution.

 

I also need to keep the original timestamp from the webpage, it is an useful information to know the exact time of the river level measurements, so my version with the HTMLExtractor does that at the moment.

 

Thanks again and best regards.

Thomas

 

 

 

 

Badge +2

Hi Crisatsafe,

 

That's pretty neat, I tried to do that but i didn't know the syntax for renaming the attributes nor did I had a clue how to get it right.

 

However if I use your fmw the table has missing values at the end:

ResultsMaybe that's because the time stamp has changed but I couldnt see why as your parameters don't interfer with it.

 

Nevertheless may thanks for your quick reply and guidance to an elegant solution.

 

I also need to keep the original timestamp from the webpage, it is an useful information to know the exact time of the river level measurements, so my version with the HTMLExtractor does that at the moment.

 

Thanks again and best regards.

Thomas

 

 

 

 

Ah perfect! Didn't see your update before posting that - glad you were able to get it all working 🙂 

 

Just re-ran it and also noticed the missing values and that appears to be because of the time difference between my machine (PST) and the website time (CEST) so I had to update it to the following to make up the time difference (9 hours in my case) as it had already rolled into July 10 on that site:

@DateTimeFormat(@DateTimeAdd(@DateTimeNow(),PT9H),%d\.%m\.%Y).*

 

BUT not important since you have it all working now ;) great solution!

Hi Crisatsafe,

 

That's pretty neat, I tried to do that but i didn't know the syntax for renaming the attributes nor did I had a clue how to get it right.

 

However if I use your fmw the table has missing values at the end:

ResultsMaybe that's because the time stamp has changed but I couldnt see why as your parameters don't interfer with it.

 

Nevertheless may thanks for your quick reply and guidance to an elegant solution.

 

I also need to keep the original timestamp from the webpage, it is an useful information to know the exact time of the river level measurements, so my version with the HTMLExtractor does that at the moment.

 

Thanks again and best regards.

Thomas

 

 

 

 

Thank you Chrisatsafe, that explains it all.

 

It would have been great to have the possibility to copy the value of the portion of the attribute name containing the date/time and reuse it later as a parameter.

 

Maybe with a Schema Reader, then with another transformer to extract a certain number of characters from the text. (a bit like 'right' formula in excel).

I didn't figure out how to do it yet.

 

Anyway, many thanks for your help and have a good day.

 

Best regards,

Thomas

 

 

 

Badge +2

Hi Crisatsafe,

 

That's pretty neat, I tried to do that but i didn't know the syntax for renaming the attributes nor did I had a clue how to get it right.

 

However if I use your fmw the table has missing values at the end:

ResultsMaybe that's because the time stamp has changed but I couldnt see why as your parameters don't interfer with it.

 

Nevertheless may thanks for your quick reply and guidance to an elegant solution.

 

I also need to keep the original timestamp from the webpage, it is an useful information to know the exact time of the river level measurements, so my version with the HTMLExtractor does that at the moment.

 

Thanks again and best regards.

Thomas

 

 

 

 

That is something that is possible using the SubstringExtractor or using a String Function like GetWord in the AttributeManager (or even a StringExtractor since we know the regex for it). The string functions will behave similar to all of the string functions in excel so you could certainly extract the info from the header and include it in an attribute if needed.

2021-07-13_8-50-38

Reply