Skip to main content
Question

Extract data from a web page

  • September 3, 2019
  • 8 replies
  • 298 views

Forum|alt.badge.img

Hi, I would like to extract data from a web page coded. I already get the password. The url is :https://www.portail-nextgen-telecom.tdf.fr. I need to read the url , find the data : Document Contractuel. Read it. If there is a value , get it. i tried with htlm extractor but the response is a script. I need a value. yes or no the Document contractuel is here. Anyone can help me ?

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

8 replies

oscard
Influencer
Forum|alt.badge.img+22
  • Influencer
  • 344 replies
  • September 3, 2019

If the response is a script or a HTML with <script> tags, HTMLExtractor won't work as expected (at least in my experience).

 

Without being able to take a look at the response, I can't help that much, but... have you tried the StringSearcher?

Forum|alt.badge.img
  • Author
  • 8 replies
  • September 3, 2019

If the response is a script or a HTML with <script> tags, HTMLExtractor won't work as expected (at least in my experience).

 

Without being able to take a look at the response, I can't help that much, but... have you tried the StringSearcher?

Yes I already did. But the response is still a script. After many tries, i succeed to extract all the html from the web site. Take a look . Now I want to extract the file name : Document Contractuel, from these htmls.

 

Thanks for your help.

oscard
Influencer
Forum|alt.badge.img+22
  • Influencer
  • 344 replies
  • September 3, 2019

Yes I already did. But the response is still a script. After many tries, i succeed to extract all the html from the web site. Take a look . Now I want to extract the file name : Document Contractuel, from these htmls.

 

Thanks for your help.

I guess now you could use the Reader "Directory and File Pathnames" to read the folder where you have stored the HTML files.

 

 

After that, use the HTMLExtractor saying the HTML Input is a file. The path of that file is in the "path_windows" attribute.

Forum|alt.badge.img
  • Author
  • 8 replies
  • September 3, 2019

What should I write for the parameters of HTMLextractor ?


oscard
Influencer
Forum|alt.badge.img+22
  • Influencer
  • 344 replies
  • September 3, 2019

What should I write for the parameters of HTMLextractor ?

First you will have to open the HTML with a text editor like Notepad++ and locate where the "Document Contractuel" is. Take a look between which tags is being written and use that info to create the query in FME.

 

 

It can get pretty complex depending on the HTML. I have found the "Help" of that transformer pretty useful, so take a look at it before doing anything. There are examples that could inspire you :)

Forum|alt.badge.img
  • Author
  • 8 replies
  • September 3, 2019

I open the HTML . We find :

The file that I want to extract is : DC_POUR_VALIDATION...docx. But the probleme is that the file is located behind an attribute : TYPE. So , what should I do to only extract this document?


oscard
Influencer
Forum|alt.badge.img+22
  • Influencer
  • 344 replies
  • September 3, 2019

I open the HTML . We find :

The file that I want to extract is : DC_POUR_VALIDATION...docx. But the probleme is that the file is located behind an attribute : TYPE. So , what should I do to only extract this document?

You are opening the HTML with a browser. You need to open it with a text editor like Notepad++ to take a look at the HTML code, so you can check how the name of the document is written. Its tags, the divs, some class... Anything that lets you build a Query for the HTMLExtractor.


Forum|alt.badge.img
  • Author
  • 8 replies
  • September 3, 2019

You are opening the HTML with a browser. You need to open it with a text editor like Notepad++ to take a look at the HTML code, so you can check how the name of the document is written. Its tags, the divs, some class... Anything that lets you build a Query for the HTMLExtractor.

Oh ok. I will fix that.

Thank you for your help