Skip to main content
Solved

Problem by reading of HTML Table


Hello FME-Experts!

I have no experience with reading of html code, but I have problems reading data from a simple html file.

If I use the HTML Table reader I get only the first record (witch contain the attribute name of the data) and not the 4 records that I need.

I think, the workflow has to be simple but I tried several options without results.

I attach an html example file.

Thank you very much for the help!

Best answer by redgeographics

Okay, I finally figured it out. There was a similar question (same sample data actually) 2 weeks ago but I can't seem to find that one anymore. I do remember looking into it and being frustrated by it.

So the problem lies with the first row of your table:

<tr>
         <td><b>Nome_comune</b></td>
         <td><b>Numero_comune</b></td>
         <td><b>Numero_sezione</b></td>
         <td><b>Particella</b></td>
         <td><b>Superficie_m&#178</b></td>
         <td><b>Tipo</font></b></td>
         <td><b>Descrizione</b></td>
         <td><b>eGRID</b></td>
       </tr>
<tr>

There is a stray </font> tag on the Tipo line. That caused the HTMLTable reader to bork out after reading that first row.

So when I manually removed that tag it read 5 rows of data, the first one being the table header, but that got turned into a feature and was not used to populate the attribute names.

Changing the <td></td> tags in that first row to <th></th> tags (table data to table header) fixed that.

As for the solution... I don't think I have one to be honest. Both issues require pre-processing of the data, which isn't always feasible with HTML tables. I think your best bet is to submit an idea for 2 feature requests for the HTML Table reader:

  1. An optional parameter to ignore stray tags.
  2. An optional parameter to assume the first row of a table contains the column names.

Hope this helps. Not the answer you were looking for I'm afraid, but at least now we know what's the problem.

View original
Did this help you find an answer to your question?

4 replies

redgeographics
Celebrity
Forum|alt.badge.img+49
  • Celebrity
  • Best Answer
  • July 27, 2020

Okay, I finally figured it out. There was a similar question (same sample data actually) 2 weeks ago but I can't seem to find that one anymore. I do remember looking into it and being frustrated by it.

So the problem lies with the first row of your table:

<tr>
         <td><b>Nome_comune</b></td>
         <td><b>Numero_comune</b></td>
         <td><b>Numero_sezione</b></td>
         <td><b>Particella</b></td>
         <td><b>Superficie_m&#178</b></td>
         <td><b>Tipo</font></b></td>
         <td><b>Descrizione</b></td>
         <td><b>eGRID</b></td>
       </tr>
<tr>

There is a stray </font> tag on the Tipo line. That caused the HTMLTable reader to bork out after reading that first row.

So when I manually removed that tag it read 5 rows of data, the first one being the table header, but that got turned into a feature and was not used to populate the attribute names.

Changing the <td></td> tags in that first row to <th></th> tags (table data to table header) fixed that.

As for the solution... I don't think I have one to be honest. Both issues require pre-processing of the data, which isn't always feasible with HTML tables. I think your best bet is to submit an idea for 2 feature requests for the HTML Table reader:

  1. An optional parameter to ignore stray tags.
  2. An optional parameter to assume the first row of a table contains the column names.

Hope this helps. Not the answer you were looking for I'm afraid, but at least now we know what's the problem.


  • Author
  • August 13, 2020

Thank you very much for your answer and help! I contacted the data producer and he removed the </font> tag. So I could read the data, skip the first line and add manually the attribute names. I added an idea for the feature request reading HTML tables.


takashi
Supporter
  • August 16, 2020

For your information, you can use the HTMLToXHTMLConverter to clean up an HTML document containing a wrong syntax in some cases. In your case, the </font> tag without corresponding starting <font> would be removed if you applied the transformer. You can then parse the resulting XHTML document with the HTML Table reader. 


mark2atsafe
Safer
Forum|alt.badge.img+43
  • Safer
  • August 17, 2020
franco wrote:

Thank you very much for your answer and help! I contacted the data producer and he removed the </font> tag. So I could read the data, skip the first line and add manually the attribute names. I added an idea for the feature request reading HTML tables.

I also created a bug report for our developers to look into this. It would be great if we could avoid failing in this scenario. fyi the reference number is FMEENGINE-66731


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings