Question

HTML table section parsing

7 years ago
August 22, 2017
10 replies
74 views

gleberre2012
2 replies

Hello,

I am trying to parse an HTML page with FME 2014 but, after a bunch of tests, I do not succeed in decoding the rather simple <table> section of my HTML imput. I fear that my troubles come from the FME version I am currently using.

I have indeed read many posts dealing with HTMLExtractor, HTML Table Reader but these transformers are only available in 2017 version (thanks to @takashi for all of them !).

Can someone tell me if there is a way to decode HTML table section with 2014 version of FME and how ? I join a sample of the file I have been trying to decode (the <table> part is the only interesting section of this file).

Any help would be really appreciated.

Thanks a lot.

+28

jdh
Contributor
1981 replies
7 years ago
August 22, 2017

HTML is just a subset of XML, so the XML transformers will work.

takashi
7683 replies
7 years ago
August 23, 2017

Hi @gleberre2012, although it's true that HTML is a subset of XML, there may be some exceptional syntax in HTML, so there could be cases where an entire HTML document cannot be processed with XML transformers. However, fortunately, the <table> section usually conforms to the XML syntax, so I would try extracting the <table> section at first.

Text File reader: Read each line from the HTML document.
Aggregator: Concatenate all the lines to form a single string (do not contain new line characters).
StringSearcher: Extract the <table> section using this regular expression.

<table.*?>.+?</table>

If you can extract the <table> section successfully with the procedure above, then parse it with XML transformers. However, there is no <table> section in the HTML document you have posted. Is it the actual source data?

+13

mygis
Supporter
307 replies
7 years ago
August 23, 2017

Hi @gleberre2012,

I can't find the <Table> tag in your HTML file when I open the page source, it is just text.

gleberre2012
Author
2 replies
7 years ago
August 23, 2017

Thank you for your answers.

First of all, the previous attached file was the result of a HTMLtoXHTMLConverter transformer. It is true that it is not a real HTML file. The URL I have to decode is the following : http://www.alertepollens.org/gardens/garden/1/state/ (its source code in fact)

Then, I must admit that I do not clearly understand your explanations : in my 2014 version of FME, there is no TextFileReader transformer. Only a TextDecoder one. Is it the one you were thinking about ?

While trying it, this latter allows me to put the whole content of the above URL in an attribute, with all new line characters, not in a single line as @takashi advised me. Then, I do not succeed in configuring an Aggregator to eliminate all new lines and concatenate all the lines.

I fear I do not have the background to understand your explanations ... but if someone has enough time to explain it, I am interested. In all cases, I need to decode this stream and I will continue to look for a solution to fix this trouble.

Gerard

+13

mygis
Supporter
307 replies
7 years ago
August 23, 2017

gleberre2012 wrote:

Thank you for your answers.

Gerard

Hi Gerard,

1. It does not matter if your page is code because it will most probably be displayed as HTML.

2. You are using an old version of FME (2014) so perhaps some of the transformers described by some users did not exist then.

3. Please do me a favor, I saw your page and it contains an html table, may I know what is the data you want to retrieve?

Is it: for example:

Herbace --- emission en cours

Armoise -- emission en cours

...etc

Thank you.

takashi
7683 replies
7 years ago
August 23, 2017

The Text File Reader is not a transformer name. A regular reader to read plain text file. This screenshot illustrates my intention.

If you set "Yes" to the "Read Whole at Once" parameter of the Text File Reader, the Aggregator can be removed.

In addition, the HTTPFetcher could also be used instead of the Text File Reader, to fetch the HTML document from the URL directly.

takashi
7683 replies
7 years ago
August 23, 2017

takashi wrote:

The Text File Reader is not a transformer name. A regular reader to read plain text file. This screenshot illustrates my intention.

If you set "Yes" to the "Read Whole at Once" parameter of the Text File Reader, the Aggregator can be removed.

In addition, the HTTPFetcher could also be used instead of the Text File Reader, to fetch the HTML document from the URL directly.

like this.

+13

mygis
Supporter
307 replies
7 years ago
August 23, 2017

takashi wrote:

like this.

Hi @takashiI cannot recall if the httpFetcher was available on 2014 (?)

takashi
7683 replies
7 years ago
August 23, 2017

takashi wrote:

like this.

Yes, the HTTPFetcher is definitely available in FME 2014.

gleberre2012
Author
2 replies
7 years ago
August 24, 2017

Hello,

Once again, I really thank you for your answers and your help.

@gisinnovationsb : exactly, the elements I want to extract are those pieces of information.

@takashi : I have followed your instructions and I have finally succeeded in decoding relevant elements of <table> structure. It is more or less something that sequentially use following transformers :

- HTTPFetcher to connect to my source and recover the stream

- AttributeSplitter to get the interesting part of the stream

- XMLFragmenter to parse <table> section

- and finally StringSearcher to extract all relevant data

I am sure my script is not very efficient but, as a first version, the job is done. So it is ok for now.

Gerard

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

HTML table section parsing