Skip to main content
Solved

How can I use HTTPCaller and HTMLExtractor to get past header tag and script?

  • March 24, 2022
  • 2 replies
  • 113 views

I am trying to extract text from a data catalog website to use for metadata updating. I think I have all the pieces to get a line of text with the correct CSS selector. The problem is that the HTTPCaller doesn't even seem to register anything after the <header> tag so I can't extract anything in the main body. My output always comes out null.

Here are a few screenshots of trying to just extract the title of the dataset page:

WorkspaceWebsiteInspectorThe _response_body returned from the HTTPCaller doesn't even have the main <body> section.

response_bodyIs there any way to force the HTTPCaller to recognize anything past the header tag or is this a security feature of the government website?

 

Thank you for your help!

Best answer by debbiatsafe

Hi @joelavigueur​ 

The output you're seeing is a result of dynamic webpages where the content is generated by scripts. Normally, you would have to work around this behaviour using various methods as shown in this Q&A.

 

However, the data portal you're working with is powered by CKAN, which offers API access to the metadata as mentioned here. Instead of using the webpage URL in the HTTPCaller, access the metadata of the dataset via the API and parse the JSON returned to get the metadata information required.

 

I've attached a workspace demonstrating this approach and I hope it helps.

This post is closed to further activity.
It may be an old question, an answered question, an implemented idea, or a notification-only post.
Please check post dates before relying on any information in a question or answer.
For follow-up or related questions, please post a new question or idea.
If there is a genuine update to be made, please contact us and request that the post is reopened.

2 replies

debbiatsafe
Safer
Forum|alt.badge.img+21
  • Safer
  • 648 replies
  • Best Answer
  • March 25, 2022

Hi @joelavigueur​ 

The output you're seeing is a result of dynamic webpages where the content is generated by scripts. Normally, you would have to work around this behaviour using various methods as shown in this Q&A.

 

However, the data portal you're working with is powered by CKAN, which offers API access to the metadata as mentioned here. Instead of using the webpage URL in the HTTPCaller, access the metadata of the dataset via the API and parse the JSON returned to get the metadata information required.

 

I've attached a workspace demonstrating this approach and I hope it helps.


  • Author
  • 1 reply
  • March 25, 2022

Thank you @debbiatsafe​ 

This helps a lot!