Skip to main content
Question

Extract data from a web page

  • September 3, 2019
  • 8 replies
  • 220 views

Forum|alt.badge.img

Hi, I would like to extract data from a web page coded. I already get the password. The url is :https://www.portail-nextgen-telecom.tdf.fr. I need to read the url , find the data : Document Contractuel. Read it. If there is a value , get it. i tried with htlm extractor but the response is a script. I need a value. yes or no the Document contractuel is here. Anyone can help me ?

8 replies

oscard
Influencer
Forum|alt.badge.img+21
  • Influencer
  • September 3, 2019

If the response is a script or a HTML with <script> tags, HTMLExtractor won't work as expected (at least in my experience).

 

Without being able to take a look at the response, I can't help that much, but... have you tried the StringSearcher?

Forum|alt.badge.img
  • Author
  • September 3, 2019
oscard wrote:

If the response is a script or a HTML with <script> tags, HTMLExtractor won't work as expected (at least in my experience).

 

Without being able to take a look at the response, I can't help that much, but... have you tried the StringSearcher?

Yes I already did. But the response is still a script. After many tries, i succeed to extract all the html from the web site. Take a look . Now I want to extract the file name : Document Contractuel, from these htmls.

 

Thanks for your help.

oscard
Influencer
Forum|alt.badge.img+21
  • Influencer
  • September 3, 2019
mika wrote:

Yes I already did. But the response is still a script. After many tries, i succeed to extract all the html from the web site. Take a look . Now I want to extract the file name : Document Contractuel, from these htmls.

 

Thanks for your help.

I guess now you could use the Reader "Directory and File Pathnames" to read the folder where you have stored the HTML files.

 

 

After that, use the HTMLExtractor saying the HTML Input is a file. The path of that file is in the "path_windows" attribute.

Forum|alt.badge.img
  • Author
  • September 3, 2019

What should I write for the parameters of HTMLextractor ?


oscard
Influencer
Forum|alt.badge.img+21
  • Influencer
  • September 3, 2019
mika wrote:

What should I write for the parameters of HTMLextractor ?

First you will have to open the HTML with a text editor like Notepad++ and locate where the "Document Contractuel" is. Take a look between which tags is being written and use that info to create the query in FME.

 

 

It can get pretty complex depending on the HTML. I have found the "Help" of that transformer pretty useful, so take a look at it before doing anything. There are examples that could inspire you :)

Forum|alt.badge.img
  • Author
  • September 3, 2019

I open the HTML . We find :

The file that I want to extract is : DC_POUR_VALIDATION...docx. But the probleme is that the file is located behind an attribute : TYPE. So , what should I do to only extract this document?


oscard
Influencer
Forum|alt.badge.img+21
  • Influencer
  • September 3, 2019
mika wrote:

I open the HTML . We find :

The file that I want to extract is : DC_POUR_VALIDATION...docx. But the probleme is that the file is located behind an attribute : TYPE. So , what should I do to only extract this document?

You are opening the HTML with a browser. You need to open it with a text editor like Notepad++ to take a look at the HTML code, so you can check how the name of the document is written. Its tags, the divs, some class... Anything that lets you build a Query for the HTMLExtractor.


Forum|alt.badge.img
  • Author
  • September 3, 2019
oscard wrote:

You are opening the HTML with a browser. You need to open it with a text editor like Notepad++ to take a look at the HTML code, so you can check how the name of the document is written. Its tags, the divs, some class... Anything that lets you build a Query for the HTMLExtractor.

Oh ok. I will fix that.

Thank you for your help


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings