Solved

Scraping dynamic web page data

6 years ago
October 10, 2018
5 replies
573 views

+13

dbaldacchino1
Enthusiast
136 replies

I've seen similar questions asked here but cannot find real solutions. So here's what I'm trying to do..

I would like to get information from this page as an example: https://help.autodesk.com/view/CONNECT/ENU/?guid=GUID-03D59AAD-65B0-45E3-84F2-A12AAA5BB267.

If you look at the page source, you'll see there isn't much valuable info. here and it seems to be using javascript to serve up the page (which updates when a new version is released, maintaining the original URL). If I use an HTTP caller with a GET method, I get the same page source. However I'm interested in the text you see in the loaded page. There are lots of sites out there built this way nowadays, which work in a very similar, dynamic way.

Is there a solution to get data from this kind of page? The only plausible (and costly) way would be to use some service such as parsehub.com (which I have not used or tested) where you could use their REST API to get data into FME: https://www.parsehub.com/docs/ref/api/v2/#introduction

Any other thoughts or tips would be greatly appreciated. Thanks!

Best answer by revesz

Unfortunately it is a task of a web browser or at least a rendering engine.

There are Python solutions to call a browser to do the rendering stuff and get the requied information from the rendered DOM.

One of them is the Selenium package. It can even call web browsers in headless mode. A headless Chrome discussion is here as a starting point.

This solution requires to install the package in FME python and Chromedriver executable to a folder which is visible by the Python script.

I'm working on a rendering solution but it is not a high priority so it may take a week ir so, however I'm happy to share it when it is reasonably stable.

View original

Did this help you find an answer to your question?

+15

revesz
Contributor
116 replies
Best Answer
6 years ago
October 12, 2018

Unfortunately it is a task of a web browser or at least a rendering engine.

There are Python solutions to call a browser to do the rendering stuff and get the requied information from the rendered DOM.

One of them is the Selenium package. It can even call web browsers in headless mode. A headless Chrome discussion is here as a starting point.

This solution requires to install the package in FME python and Chromedriver executable to a folder which is visible by the Python script.

I'm working on a rendering solution but it is not a high priority so it may take a week ir so, however I'm happy to share it when it is reasonably stable.

+13

dbaldacchino1
Author
Enthusiast
136 replies
6 years ago
October 12, 2018

Thanks a lot @revesz. My Python skills are...uhm..copy & paste mostly :) So I would definitely appreciate you sharing any of your content and perhaps use it to learn more. My goal for this project is to scrape the text to know when a new version of the software is available. I have done this on other sites but they either have some table exposed and I can get that directly in FME, or there is an API available which I was able to figure out from the page source itself.

David Baldacchino | HOK.com

+13

dbaldacchino1
Author
Enthusiast
136 replies
6 years ago
October 13, 2018

Ok so I was able to get to the data without additional tools. Lucky? Maybe :) Took some sleuthing, but here's how I got to what I needed for the example above:

I saved the web page but selected the option that I just noticed in Chrome "Webpage, Complete"
I opened this in Notepad++ and tried reading through to see if I could find anything helpful. I noticed that the tag I wanted was "h2", but most importantly, I noticed this around the middle of the saved page:
I tested by reconstructing the URL as "https://help.autodesk.com/cloudhelp/ENU/CONNECT/files/GUID-03D59AAD-65B0-45E3-84F2-A12AAA5BB267.htm" and the page loaded (URL re-directed to the original one)
In FME I used an HTMLExtractor on this new URL:
And BAAM! As happy as can be :)

I guess when there's a will, there's a way!

David Baldacchino | HOK.com

spatialcase
Contributor
13 replies
4 years ago
August 16, 2020

I have ran into the same problem when attempting to scrape some data from a site that uses JavaScript to load the data into the page. When the HTTPCaller transformer completes the page is not fully loaded because the JavaScript function associated with the page onLoad event has not been fired.

The only way that I could find to address this was to use the PythonCaller, together with the Selenium browser automation module (https://www.selenium.dev/documentation/en/). For a useful guide on using Selenium with Python to scrape data from web pages, see https://www.scrapingbee.com/blog/selenium-python/

The Python Selenium module needs to be installed on your machine - see https://docs.safe.com/fme/html/FME_Desktop_Documentation/FME_Workbench/Workbench/Installing-Python-Packages.htm, and https://pypi.org/project/selenium/

You will also need to decide which web browser you wish to use and what version of the browser you have, then install the matching webdriver application. For example, for the Chrome web browser, refer to this site: https://chromedriver.chromium.org/getting-started

I set my PythonCaller up as follows:

import fme
import fmeobjects
from seleniumwire import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from selenium.webdriver.support.select import Select
from selenium.common.exceptions import TimeoutException
import time
 
 
class FeatureProcessor(object):
    def __init__(self):
        
        print("PythonCaller: initialising web browser")
        
        access_token =  fme.macroValues['OAuth2_token']
        
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        chrome_options.page_load_strategy = 'normal'
 
        webdriver.header_overrides = {'Authorization': 'Bearer ' + access_token}
 
        # Luanch headless web browser, need to to specify webdriver executable disk location
        self.web_browser = webdriver.Chrome(options=chrome_options, executable_path=r"C:\Apps\chromedriver_win32\chromedriver.exe")  
 
        
    def input(self,feature):
 
        calendar = feature.getAttribute("userPrincipalName")
        booking_url = "https://outlook.office365.com/owa/calendar/"+calendar+"/bookings/"
        id = feature.getAttribute("id")
        out_html = "C:/Temp/"+id+".html"
        
        print("Getting data for: " + calendar)
        
        self.web_browser.get(booking_url)
        delay = 5
        try:
            myElem = WebDriverWait(self.web_browser, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.focusable.dates")))
            print('Page is fully loaded')
            
            html = self.web_browser.page_source
 
            with open(out_html, "w+", encoding="utf-8") as f:
                f.write(html)
                
            feature.setAttribute("html_file",out_html)
            print('html file is ready')
                        
        except TimeoutException:
            print('page ' + calendar + ' is taking too long')
            feature.setAttribute("html_file","Error")   
 
        self.pyoutput(feature)
 
    def close(self):
        print("PythonCaller - Closing")
        self.web_browser.quit()

The PythonCaller init() routine setups the headless web browser.

The input() method processes every input feature - it calls the corresponding web page and then waits for the onload JavaScript to complete (waits for a specific div to load) and then saves the resultant html to a temporary file - which can be accessed and processed by the FME HTMLExtractor transformer.

The page I was calling also requires token authentication so I obtain the token as a Python Scripted parameter at startup and then read it into the init() method. I also needed to include the seleniumwire module to include the token in the call to the web server.

This is working fine on my desktop but unfortunately I cannot publish this workbench to FME Cloud as there is no way to install the webdriver application onto FME Cloud. One step at a time...

+13

dbaldacchino1
Author
Enthusiast
136 replies
4 years ago
August 17, 2020

spatialcase wrote:

I set my PythonCaller up as follows:

import fme
import fmeobjects
from seleniumwire import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from selenium.webdriver.support.select import Select
from selenium.common.exceptions import TimeoutException
import time
 
 
class FeatureProcessor(object):
    def __init__(self):
        
        print("PythonCaller: initialising web browser")
        
        access_token =  fme.macroValues['OAuth2_token']
        
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        chrome_options.page_load_strategy = 'normal'
 
        webdriver.header_overrides = {'Authorization': 'Bearer ' + access_token}
 
        # Luanch headless web browser, need to to specify webdriver executable disk location
        self.web_browser = webdriver.Chrome(options=chrome_options, executable_path=r"C:\Apps\chromedriver_win32\chromedriver.exe")  
 
        
    def input(self,feature):
 
        calendar = feature.getAttribute("userPrincipalName")
        booking_url = "https://outlook.office365.com/owa/calendar/"+calendar+"/bookings/"
        id = feature.getAttribute("id")
        out_html = "C:/Temp/"+id+".html"
        
        print("Getting data for: " + calendar)
        
        self.web_browser.get(booking_url)
        delay = 5
        try:
            myElem = WebDriverWait(self.web_browser, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.focusable.dates")))
            print('Page is fully loaded')
            
            html = self.web_browser.page_source
 
            with open(out_html, "w+", encoding="utf-8") as f:
                f.write(html)
                
            feature.setAttribute("html_file",out_html)
            print('html file is ready')
                        
        except TimeoutException:
            print('page ' + calendar + ' is taking too long')
            feature.setAttribute("html_file","Error")   
 
        self.pyoutput(feature)
 
    def close(self):
        print("PythonCaller - Closing")
        self.web_browser.quit()

The PythonCaller init() routine setups the headless web browser.

This is working fine on my desktop but unfortunately I cannot publish this workbench to FME Cloud as there is no way to install the webdriver application onto FME Cloud. One step at a time...

Thanks for the detailed reply @spatialcase

David Baldacchino | HOK.com

Reply

Rich Text Editor, editor1

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Scraping dynamic web page data