Configuring a Python App to Crawl JavaScript-Generated Web Content
Overview
The goal is first to create a python app that will fetch and parse remote website content.
The 2nd goal is to accommodate sites that have content that is generated with JavaScript frameworks such as Angular or React. Simple HTTP requests to these sites will not necessarily retrieve the same content as what would be viewed through a browser.
The final goal is to run the python app as a cron job on a Synology NAS.
Headless browsers
Some sort of software is needed as a headless browser, which will fetch web content after rendering it as it would be in a browser such as Chrome or Firefox.
Options include:
The main functional difference between Selenium and Splash is that Selenium is synchronous and can emulate user interaction with a webpage. Splash is asynchronous and cannot emulate user interaction. If user interaction is not required by the app, Splash is a good solution because it is lighter weight and asynchronous, and therefore faster than Selenium.
These frameworks require drivers to interface with a selected browser (e.g. Chrome or Firefox).
Install Selenium for python with
$ pip install selenium
This provides the API, which in turn relies on a driver to be running.
Browser drivers
A list of browser drivers can be found in the Selenium documentation. However, these drivers aren't necessarily supported by the Synology DSM.
A solution for browser support on DSM is Docker. Docker can then run a package that serves as a browser for the purposes of Selenium, for example. Instructions for installing and getting started with Docker for Mac OS.
First install Docker, then pull a Docker image that implements the webdriver interface, e.g. joyzoursky/python-chromedriver, SeleniumHQ/docker-selenium, danielfrg/docker-selenium [1] [2]
Example: Testing with SeleniumHQ/docker-selenium
See Getting Started With Docker Compose from the docker-selenium wiki.
Example: Testing with joyzoursky/python-chromedriver
The following Docker command will download and run a selenium Chrome driver interface for python:
$ docker run -it -v $(pwd):/usr/workspace joyzoursky/python-chromedriver:3.6-alpine3.7-selenium sh
The driver can be tested in the python console
>>> from selenium import webdriver
>>> chrome_options = webdriver.ChromeOptions()
>>> chrome_options.add_argument('--no-sandbox')
>>> chrome_options.add_argument('--window-size=1420,1080')
>>> chrome_options.add_argument('--headless')
>>> chrome_options.add_argument('--disable-gpu')
>>> driver = webdriver.Chrome(chrome_options=chrome_options)
This configures and starts a headless Chrome instance. Next, fetch web content and extract some content:
>>> driver.get('https://sjobs.brassring.com/TGnewUI/Search/Home/HomeWithPreLoad?partnerid=25354&siteid=5108&PageType=searchResults&SearchType=linkquery&LinkID=3947569#keyWordSearch=&locationSearch=')
>>> el = driver.find_element_by_id('Job_7')
>>> el.text
'DreamWorks Technology - Software Engineer - Layout/Previz Tools'
>>> elements = driver.find_elements_by_xpath('//a[@class="jobProperty jobtitle"]')
>>> for element in elements:
... print(element.text)
...
Director, Talent Development
Content Producer WWSI
Golfer Care Specialist - Seasonal
# etc...
Installation on Synology NAS
Install Docker through the DSM Package Center.
Starting the Selenium Docker image requires sudo:
Create a docker-compose.yml file in the working directory:
selenium-hub: image: selenium/hub ports: - 4444:4444 chrome: image: selenium/node-chrome links: - selenium-hub:hub volumes: - /dev/shm:/dev/shm # Mitigates the Chromium issue described at https://code.google.com/p/chromium/issues/detail?id=519952
$ docker-compose up -d --scale chrome=5
TODO: Confirm that the Docker containers restart if the system reboots, e.g. after a system update.
Notes
See also
- Retrieving JavaScript-Generated Web Content -
job-alertsapp wiki on GitHub
References
- ↑ A recipe for website automated tests with Python Selenium & Headless Chrome in Docker - freeCodeCamp
- ↑ Crawling with Python, Selenium and Docker - Daniel Rodriguez