Skip to main content

Hello FME-Experts!

I have no experience with reading of html code, but I have problems reading data from a simple html file.

If I use the HTML Table reader I get only the first record (witch contain the attribute name of the data) and not the 4 records that I need.

I think, the workflow has to be simple but I tried several options without results.

I attach an html example file.

Thank you very much for the help!

Okay, I finally figured it out. There was a similar question (same sample data actually) 2 weeks ago but I can't seem to find that one anymore. I do remember looking into it and being frustrated by it.

So the problem lies with the first row of your table:

<tr>
         <td><b>Nome_comune</b></td>
         <td><b>Numero_comune</b></td>
         <td><b>Numero_sezione</b></td>
         <td><b>Particella</b></td>
         <td><b>Superficie_m&#178</b></td>
         <td><b>Tipo</font></b></td>
         <td><b>Descrizione</b></td>
         <td><b>eGRID</b></td>
       </tr>
<tr>

There is a stray </font> tag on the Tipo line. That caused the HTMLTable reader to bork out after reading that first row.

So when I manually removed that tag it read 5 rows of data, the first one being the table header, but that got turned into a feature and was not used to populate the attribute names.

Changing the <td></td> tags in that first row to <th></th> tags (table data to table header) fixed that.

As for the solution... I don't think I have one to be honest. Both issues require pre-processing of the data, which isn't always feasible with HTML tables. I think your best bet is to submit an idea for 2 feature requests for the HTML Table reader:

  1. An optional parameter to ignore stray tags.
  2. An optional parameter to assume the first row of a table contains the column names.

Hope this helps. Not the answer you were looking for I'm afraid, but at least now we know what's the problem.


Thank you very much for your answer and help! I contacted the data producer and he removed the </font> tag. So I could read the data, skip the first line and add manually the attribute names. I added an idea for the feature request reading HTML tables.


For your information, you can use the HTMLToXHTMLConverter to clean up an HTML document containing a wrong syntax in some cases. In your case, the </font> tag without corresponding starting <font> would be removed if you applied the transformer. You can then parse the resulting XHTML document with the HTML Table reader. 


Thank you very much for your answer and help! I contacted the data producer and he removed the </font> tag. So I could read the data, skip the first line and add manually the attribute names. I added an idea for the feature request reading HTML tables.

I also created a bug report for our developers to look into this. It would be great if we could avoid failing in this scenario. fyi the reference number is FMEENGINE-66731


Reply