Skip to main content

I have an issue in which my HTMLExtractor does not give the page time to load before it returns a null value for the DIV im trying to scrape.  It works well on pages that load quickly, but this particular page can take up to 10 seconds to load.  

 

How do I work around this?  Thank you!

Hello @vsalazar 

Can you try using the HTTPCaller before the HTMLExtractor to retrieve the page? The HTTPCaller has various options such as multipart response handling and concurrent requests that may help.


It may also be that the content is being loaded after the HTML loads via Javascript. Javascript needs to be rendered client side so doing the server request via HTMLExtractor/HTTPCaller is not going to render the JS.

The plus side of JS tho, is that it is very likely then that the data is being pulled from an API which is a much more elegant and structured way than reading the webpage.

If you open dev tools on your browser and refresh the page, hopefully you’ll see some calls that contain the data you’re wanting. (there may be 100s of those calls, but once you know what you’re looking for it will be easy to find).

You can then just use the HTTPCaller to call the api directly. This is of course assuming there are no security measures in place, in which case it gets a bit harder


Reply