Editing Configuring a Python App to Crawl JavaScript-Generated Web Content

== Overview ==

The goal is first to create a python app that will fetch and parse remote website content. 

The 2nd goal is to accommodate sites that have content that is generated with JavaScript frameworks such as Angular or React. Simple HTTP requests to these sites will not necessarily retrieve the same content as what would be viewed through a browser.

The final goal is to run the python app as a cron job on a Synology NAS. 

== Headless browsers ==

Some sort of software is needed as a headless browser, which will fetch web content after rendering it as it would be in a browser such as Chrome or Firefox. 

Options include:

* [https://www.seleniumhq.org/ Selenium]
* [https://github.com/SeleniumHQ/docker-selenium docker-selenium]
* [http://phantomjs.org/ PhantomJS]
* [https://splash.readthedocs.io/en/stable/ Splash]

The main functional difference between Selenium and Splash is that Selenium is synchronous and can emulate user interaction with a webpage. Splash is asynchronous and cannot emulate user interaction. If user interaction is not required by the app, Splash is a good solution because it is lighter weight and asynchronous, and therefore faster than Selenium.

These frameworks require drivers to interface with a selected browser (e.g. Chrome or Firefox).

Install Selenium for python with

<syntaxhighlight lang="bash">
$ pip install selenium
</syntaxhighlight>

This provides the API, which in turn relies on a driver to be running.

== Browser drivers == 

A list of browser drivers can be found in the [http://selenium-python.readthedocs.io/installation.html#drivers Selenium documentation]. However, these drivers aren't necessarily supported by the Synology DSM. 

A solution for browser support on DSM is [https://www.docker.com/ Docker]. Docker can then run a package that serves as a browser for the purposes of Selenium, for example. [https://docs.docker.com/docker-for-mac/ Instructions for installing and getting started with Docker for Mac OS].

First install Docker, then pull a Docker image that implements the webdriver interface, e.g. [https://hub.docker.com/r/joyzoursky/python-chromedriver/ joyzoursky/python-chromedriver], [https://github.com/SeleniumHQ/docker-selenium SeleniumHQ/docker-selenium], [https://github.com/danielfrg/docker-selenium danielfrg/docker-selenium] <ref>[https://medium.freecodecamp.org/a-recipe-for-website-automated-tests-with-python-selenium-headless-chrome-in-docker-8d344a97afb5 A recipe for website automated tests with Python Selenium & Headless Chrome in Docker] - freeCodeCamp</ref> <ref>[http://danielfrg.com/blog/2015/09/28/crawling-python-selenium-docker/ Crawling with Python, Selenium and Docker] - Daniel Rodriguez</ref>

=== Installing chromedriver ===

==== Mac OS ====

Download the latest version of chromedriver from [https://sites.google.com/a/chromium.org/chromedriver/ chromium.org]. 

Unzip and copy the driver to `/usr/local/bin`

New downloads will need to have permission to run configured in System Preferences.

=== Example: Testing with `SeleniumHQ/docker-selenium` ===

See [https://github.com/SeleniumHQ/docker-selenium/wiki/Getting-Started-with-Docker-Compose Getting Started With Docker Compose] from the [https://github.com/SeleniumHQ/docker-selenium/wiki docker-selenium wiki].

=== Example: Testing with ` joyzoursky/python-chromedriver` ===

The following Docker command will download and run a selenium Chrome driver interface for python:

<syntaxhighlight lang="bash">
$ docker run -it -v $(pwd):/usr/workspace joyzoursky/python-chromedriver:3.6-alpine3.7-selenium sh
</syntaxhighlight>

The driver can be tested in the python console 

<syntaxhighlight lang="python">
>>> from selenium import webdriver
>>> chrome_options = webdriver.ChromeOptions()
>>> chrome_options.add_argument('--no-sandbox')
>>> chrome_options.add_argument('--window-size=1420,1080')
>>> chrome_options.add_argument('--headless')
>>> chrome_options.add_argument('--disable-gpu')
>>> driver = webdriver.Chrome(chrome_options=chrome_options)
</syntaxhighlight>

This configures and starts a headless Chrome instance. Next, fetch web content and extract some content:

<syntaxhighlight lang="python">
>>> driver.get('https://sjobs.brassring.com/TGnewUI/Search/Home/HomeWithPreLoad?partnerid=25354&siteid=5108&PageType=searchResults&SearchType=linkquery&LinkID=3947569#keyWordSearch=&locationSearch=')
>>> el = driver.find_element_by_id('Job_7')
>>> el.text
'DreamWorks Technology - Software Engineer - Layout/Previz Tools'
>>> elements = driver.find_elements_by_xpath('//a[@class="jobProperty jobtitle"]')
>>> for element in elements:
...     print(element.text)
... 
Director, Talent Development
Content Producer WWSI
Golfer Care Specialist - Seasonal
# etc...
</syntaxhighlight>

== Installation on Synology NAS ==

Install Docker through the DSM Package Center.

[https://github.com/SeleniumHQ/docker-selenium/wiki/Getting-Started-with-Docker-Compose Starting the Selenium Docker image] requires `sudo`:

Create a `docker-compose.yml` file in the working directory:

<syntaxhighlight lang="yml">
selenium-hub:
  image: selenium/hub
  ports:
  - 4444:4444

chrome:
  image: selenium/node-chrome
  links:
  - selenium-hub:hub
  volumes:
  - /dev/shm:/dev/shm # Mitigates the Chromium issue described at https://code.google.com/p/chromium/issues/detail?id=519952
</syntaxhighlight>

<syntaxhighlight lang="bash">
$ docker-compose up -d --scale chrome=5
</syntaxhighlight>

=== Diagnostics ===

After running the `docker-compose up` command, it should be possible to view the grid console in a browser at http://localhost:4444/grid/console. This should display the 5 Chrome nodes.

'''TODO:''' Confirm that the Docker containers restart if the system reboots, e.g. after a system update.

== Notes ==

=== See also ===

* [https://github.com/dbarchowsky/job-alerts/wiki/Retrieving-JavaScript-Generated-Web-Content Retrieving JavaScript-Generated Web Content] - `job-alerts` app wiki on GitHub

=== References ===
<references />

[[Category:Python]][[Category:Web Development]]