Configuring a Python App to Crawl JavaScript-Generated Web Content

Overview

The goal is first to create a python app that will fetch and parse remote website content.

The 2nd goal is to accommodate sites that have content that is generated with JavaScript frameworks such as Angular or React. Simple HTTP requests to these sites will not necessarily retrieve the same content as what would be viewed through a browser.

The final goal is to run the python app as a cron job on a Synology NAS.

Headless browsers

Some sort of software is needed as a headless browser, which will fetch web content after rendering it as it would be in a browser such as Chrome or Firefox.

Options include:

The main functional difference between Selenium and Splash is that Selenium is synchronous and can emulate user interaction with a webpage. Splash is asynchronous and cannot emulate user interaction. If user interaction is not required by the app, Splash is a good solution because it is lighter weight and asynchronous, and therefore faster than Selenium.

These frameworks require drivers to interface with a selected browser (e.g. Chrome or Firefox).

Browser drivers

A list of browser drivers can be found in the Selenium documentation. However, these drivers aren't necessarily supported by the Synology DSM.

A solution for browser support on DSM is Docker. Docker can then run a package that serves as a browser for the purposes of Selenium, for example. Instructions for installing and getting started with Docker for Mac OS.

Configuring a Python App to Crawl JavaScript-Generated Web Content

Overview

Headless browsers

Browser drivers

Navigation menu

Search