Skip to main content

Hi, I would like to extract data from a web page coded. I already get the password. The url is :https://www.portail-nextgen-telecom.tdf.fr. I need to read the url , find the data : Document Contractuel. Read it. If there is a value , get it. i tried with htlm extractor but the response is a script. I need a value. yes or no the Document contractuel is here. Anyone can help me ?

If the response is a script or a HTML with <script> tags, HTMLExtractor won't work as expected (at least in my experience).

 

Without being able to take a look at the response, I can't help that much, but... have you tried the StringSearcher?

If the response is a script or a HTML with <script> tags, HTMLExtractor won't work as expected (at least in my experience).

 

Without being able to take a look at the response, I can't help that much, but... have you tried the StringSearcher?

Yes I already did. But the response is still a script. After many tries, i succeed to extract all the html from the web site. Take a look . Now I want to extract the file name : Document Contractuel, from these htmls.

 

Thanks for your help.

Yes I already did. But the response is still a script. After many tries, i succeed to extract all the html from the web site. Take a look . Now I want to extract the file name : Document Contractuel, from these htmls.

 

Thanks for your help.

I guess now you could use the Reader "Directory and File Pathnames" to read the folder where you have stored the HTML files.

 

 

After that, use the HTMLExtractor saying the HTML Input is a file. The path of that file is in the "path_windows" attribute.

What should I write for the parameters of HTMLextractor ?


What should I write for the parameters of HTMLextractor ?

First you will have to open the HTML with a text editor like Notepad++ and locate where the "Document Contractuel" is. Take a look between which tags is being written and use that info to create the query in FME.

 

 

It can get pretty complex depending on the HTML. I have found the "Help" of that transformer pretty useful, so take a look at it before doing anything. There are examples that could inspire you :)

I open the HTML . We find :

The file that I want to extract is : DC_POUR_VALIDATION...docx. But the probleme is that the file is located behind an attribute : TYPE. So , what should I do to only extract this document?


I open the HTML . We find :

The file that I want to extract is : DC_POUR_VALIDATION...docx. But the probleme is that the file is located behind an attribute : TYPE. So , what should I do to only extract this document?

You are opening the HTML with a browser. You need to open it with a text editor like Notepad++ to take a look at the HTML code, so you can check how the name of the document is written. Its tags, the divs, some class... Anything that lets you build a Query for the HTMLExtractor.


You are opening the HTML with a browser. You need to open it with a text editor like Notepad++ to take a look at the HTML code, so you can check how the name of the document is written. Its tags, the divs, some class... Anything that lets you build a Query for the HTMLExtractor.

Oh ok. I will fix that.

Thank you for your help