Skip to main content

I am trying to extract text from a data catalog website to use for metadata updating. I think I have all the pieces to get a line of text with the correct CSS selector. The problem is that the HTTPCaller doesn't even seem to register anything after the <header> tag so I can't extract anything in the main body. My output always comes out null.

Here are a few screenshots of trying to just extract the title of the dataset page:

WorkspaceWebsiteInspectorThe _response_body returned from the HTTPCaller doesn't even have the main <body> section.

response_bodyIs there any way to force the HTTPCaller to recognize anything past the header tag or is this a security feature of the government website?

 

Thank you for your help!

Hi @joelavigueur​ 

The output you're seeing is a result of dynamic webpages where the content is generated by scripts. Normally, you would have to work around this behaviour using various methods as shown in this Q&A.

 

However, the data portal you're working with is powered by CKAN, which offers API access to the metadata as mentioned here. Instead of using the webpage URL in the HTTPCaller, access the metadata of the dataset via the API and parse the JSON returned to get the metadata information required.

 

I've attached a workspace demonstrating this approach and I hope it helps.


Thank you @debbiatsafe​ 

This helps a lot!


Reply