Skip to main content

Hello,

I am trying to parse an HTML page with FME 2014 but, after a bunch of tests, I do not succeed in decoding the rather simple <table> section of my HTML imput. I fear that my troubles come from the FME version I am currently using.

I have indeed read many posts dealing with HTMLExtractor, HTML Table Reader but these transformers are only available in 2017 version (thanks to @takashi for all of them !).

Can someone tell me if there is a way to decode HTML table section with 2014 version of FME and how ? I join a sample of the file I have been trying to decode (the <table> part is the only interesting section of this file).

Any help would be really appreciated.

Thanks a lot.

HTML is just a subset of XML, so the XML transformers will work.


Hi @gleberre2012, although it's true that HTML is a subset of XML, there may be some exceptional syntax in HTML, so there could be cases where an entire HTML document cannot be processed with XML transformers. However, fortunately, the <table> section usually conforms to the XML syntax, so I would try extracting the <table> section at first.

  1. Text File reader: Read each line from the HTML document.
  2. Aggregator: Concatenate all the lines to form a single string (do not contain new line characters).

  3. StringSearcher: Extract the <table> section using this regular expression.
<table.*?>.+?</table>
If you can extract the <table> section successfully with the procedure above, then parse it with XML transformers. However, there is no <table> section in the HTML document you have posted. Is it the actual source data?

Hi @gleberre2012,

I can't find the <Table> tag in your HTML file when I open the page source, it is just text.


Thank you for your answers.

First of all, the previous attached file was the result of a HTMLtoXHTMLConverter transformer. It is true that it is not a real HTML file. The URL I have to decode is the following : http://www.alertepollens.org/gardens/garden/1/state/ (its source code in fact)

Then, I must admit that I do not clearly understand your explanations : in my 2014 version of FME, there is no TextFileReader transformer. Only a TextDecoder one. Is it the one you were thinking about ?

While trying it, this latter allows me to put the whole content of the above URL in an attribute, with all new line characters, not in a single line as @takashi advised me. Then, I do not succeed in configuring an Aggregator to eliminate all new lines and concatenate all the lines.

I fear I do not have the background to understand your explanations ... but if someone has enough time to explain it, I am interested. In all cases, I need to decode this stream and I will continue to look for a solution to fix this trouble.

Gerard


Thank you for your answers.

First of all, the previous attached file was the result of a HTMLtoXHTMLConverter transformer. It is true that it is not a real HTML file. The URL I have to decode is the following : http://www.alertepollens.org/gardens/garden/1/state/ (its source code in fact)

Then, I must admit that I do not clearly understand your explanations : in my 2014 version of FME, there is no TextFileReader transformer. Only a TextDecoder one. Is it the one you were thinking about ?

While trying it, this latter allows me to put the whole content of the above URL in an attribute, with all new line characters, not in a single line as @takashi advised me. Then, I do not succeed in configuring an Aggregator to eliminate all new lines and concatenate all the lines.

I fear I do not have the background to understand your explanations ... but if someone has enough time to explain it, I am interested. In all cases, I need to decode this stream and I will continue to look for a solution to fix this trouble.

Gerard

Hi Gerard,

 

 

1. It does not matter if your page is code because it will most probably be displayed as HTML.

 

2. You are using an old version of FME (2014) so perhaps some of the transformers described by some users did not exist then.

 

3. Please do me a favor, I saw your page and it contains an html table, may I know what is the data you want to retrieve?

 


 


Is it: for example:

 


 

Herbace --- emission en cours

 

Armoise -- emission en cours

 

...etc

 


 

Thank you.

 

 


The Text File Reader is not a transformer name. A regular reader to read plain text file. This screenshot illustrates my intention.

If you set "Yes" to the "Read Whole at Once" parameter of the Text File Reader, the Aggregator can be removed.

In addition, the HTTPFetcher could also be used instead of the Text File Reader, to fetch the HTML document from the URL directly.


The Text File Reader is not a transformer name. A regular reader to read plain text file. This screenshot illustrates my intention.

If you set "Yes" to the "Read Whole at Once" parameter of the Text File Reader, the Aggregator can be removed.

In addition, the HTTPFetcher could also be used instead of the Text File Reader, to fetch the HTML document from the URL directly.

like this.

 

 


like this.

 

 

Hi @takashiI cannot recall if the httpFetcher was available on 2014 (?)

 


like this.

 

 

Yes, the HTTPFetcher is definitely available in FME 2014.

 


Hello,

Once again, I really thank you for your answers and your help.

@gisinnovationsb : exactly, the elements I want to extract are those pieces of information.

@takashi : I have followed your instructions and I have finally succeeded in decoding relevant elements of <table> structure. It is more or less something that sequentially use following transformers :

- HTTPFetcher to connect to my source and recover the stream

- AttributeSplitter to get the interesting part of the stream

- XMLFragmenter to parse <table> section

- and finally StringSearcher to extract all relevant data

I am sure my script is not very efficient but, as a first version, the job is done. So it is ok for now.

Gerard


Reply