Solved

How can I use HTTPCaller and HTMLExtractor to get past header tag and script?

3 years ago
March 24, 2022
2 replies
80 views

joelavigueur
1 reply

I am trying to extract text from a data catalog website to use for metadata updating. I think I have all the pieces to get a line of text with the correct CSS selector. The problem is that the HTTPCaller doesn't even seem to register anything after the <header> tag so I can't extract anything in the main body. My output always comes out null.

Here are a few screenshots of trying to just extract the title of the dataset page:

Workspace Website Inspector The _response_body returned from the HTTPCaller doesn't even have the main <body> section.

response_body Is there any way to force the HTTPCaller to recognize anything past the header tag or is this a security feature of the government website?

Thank you for your help!

Best answer by debbiatsafe

Hi @joelavigueur

The output you're seeing is a result of dynamic webpages where the content is generated by scripts. Normally, you would have to work around this behaviour using various methods as shown in this Q&A.

However, the data portal you're working with is powered by CKAN, which offers API access to the metadata as mentioned here. Instead of using the webpage URL in the HTTPCaller, access the metadata of the dataset via the API and parse the JSON returned to get the metadata information required.

I've attached a workspace demonstrating this approach and I hope it helps.

View original

Did this help you find an answer to your question?

+20

debbiatsafe
Safer
648 replies
Best Answer
3 years ago
March 25, 2022

Hi @joelavigueur

The output you're seeing is a result of dynamic webpages where the content is generated by scripts. Normally, you would have to work around this behaviour using various methods as shown in this Q&A.

I've attached a workspace demonstrating this approach and I hope it helps.

joelavigueur
Author
1 reply
3 years ago
March 25, 2022

Thank you @debbiatsafe

This helps a lot!

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

How can I use HTTPCaller and HTMLExtractor to get past header tag and script?