Editing
Configuring a Python App to Crawl JavaScript-Generated Web Content
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Overview == The goal is first to create a python app that will fetch and parse remote website content. The 2nd goal is to accommodate sites that have content that is generated with JavaScript frameworks such as Angular or React. Simple HTTP requests to these sites will not necessarily retrieve the same content as what would be viewed through a browser. The final goal is to run the python app as a cron job on a Synology NAS. == Headless browsers == Some sort of software is needed as a headless browser, which will fetch web content after rendering it as it would be in a browser such as Chrome or Firefox. Options include: * [https://www.seleniumhq.org/ Selenium] * [https://github.com/SeleniumHQ/docker-selenium docker-selenium] * [http://phantomjs.org/ PhantomJS] * [https://splash.readthedocs.io/en/stable/ Splash] The main functional difference between Selenium and Splash is that Selenium is synchronous and can emulate user interaction with a webpage. Splash is asynchronous and cannot emulate user interaction. If user interaction is not required by the app, Splash is a good solution because it is lighter weight and asynchronous, and therefore faster than Selenium. These frameworks require drivers to interface with a selected browser (e.g. Chrome or Firefox). Install Selenium for python with <syntaxhighlight lang="bash"> $ pip install selenium </syntaxhighlight> This provides the API, which in turn relies on a driver to be running. == Browser drivers == A list of browser drivers can be found in the [http://selenium-python.readthedocs.io/installation.html#drivers Selenium documentation]. However, these drivers aren't necessarily supported by the Synology DSM. A solution for browser support on DSM is [https://www.docker.com/ Docker]. Docker can then run a package that serves as a browser for the purposes of Selenium, for example. [https://docs.docker.com/docker-for-mac/ Instructions for installing and getting started with Docker for Mac OS]. First install Docker, then pull a Docker image that implements the webdriver interface, e.g. [https://hub.docker.com/r/joyzoursky/python-chromedriver/ joyzoursky/python-chromedriver], [https://github.com/SeleniumHQ/docker-selenium SeleniumHQ/docker-selenium], [https://github.com/danielfrg/docker-selenium danielfrg/docker-selenium] <ref>[https://medium.freecodecamp.org/a-recipe-for-website-automated-tests-with-python-selenium-headless-chrome-in-docker-8d344a97afb5 A recipe for website automated tests with Python Selenium & Headless Chrome in Docker] - freeCodeCamp</ref> <ref>[http://danielfrg.com/blog/2015/09/28/crawling-python-selenium-docker/ Crawling with Python, Selenium and Docker] - Daniel Rodriguez</ref> === Installing chromedriver === ==== Mac OS ==== Download the latest version of chromedriver from [https://sites.google.com/a/chromium.org/chromedriver/ chromium.org]. Unzip and copy the driver to `/usr/local/bin` New downloads will need to have permission to run configured in System Preferences. === Example: Testing with `SeleniumHQ/docker-selenium` === See [https://github.com/SeleniumHQ/docker-selenium/wiki/Getting-Started-with-Docker-Compose Getting Started With Docker Compose] from the [https://github.com/SeleniumHQ/docker-selenium/wiki docker-selenium wiki]. === Example: Testing with ` joyzoursky/python-chromedriver` === The following Docker command will download and run a selenium Chrome driver interface for python: <syntaxhighlight lang="bash"> $ docker run -it -v $(pwd):/usr/workspace joyzoursky/python-chromedriver:3.6-alpine3.7-selenium sh </syntaxhighlight> The driver can be tested in the python console <syntaxhighlight lang="python"> >>> from selenium import webdriver >>> chrome_options = webdriver.ChromeOptions() >>> chrome_options.add_argument('--no-sandbox') >>> chrome_options.add_argument('--window-size=1420,1080') >>> chrome_options.add_argument('--headless') >>> chrome_options.add_argument('--disable-gpu') >>> driver = webdriver.Chrome(chrome_options=chrome_options) </syntaxhighlight> This configures and starts a headless Chrome instance. Next, fetch web content and extract some content: <syntaxhighlight lang="python"> >>> driver.get('https://sjobs.brassring.com/TGnewUI/Search/Home/HomeWithPreLoad?partnerid=25354&siteid=5108&PageType=searchResults&SearchType=linkquery&LinkID=3947569#keyWordSearch=&locationSearch=') >>> el = driver.find_element_by_id('Job_7') >>> el.text 'DreamWorks Technology - Software Engineer - Layout/Previz Tools' >>> elements = driver.find_elements_by_xpath('//a[@class="jobProperty jobtitle"]') >>> for element in elements: ... print(element.text) ... Director, Talent Development Content Producer WWSI Golfer Care Specialist - Seasonal # etc... </syntaxhighlight> == Installation on Synology NAS == Install Docker through the DSM Package Center. [https://github.com/SeleniumHQ/docker-selenium/wiki/Getting-Started-with-Docker-Compose Starting the Selenium Docker image] requires `sudo`: Create a `docker-compose.yml` file in the working directory: <syntaxhighlight lang="yml"> selenium-hub: image: selenium/hub ports: - 4444:4444 chrome: image: selenium/node-chrome links: - selenium-hub:hub volumes: - /dev/shm:/dev/shm # Mitigates the Chromium issue described at https://code.google.com/p/chromium/issues/detail?id=519952 </syntaxhighlight> <syntaxhighlight lang="bash"> $ docker-compose up -d --scale chrome=5 </syntaxhighlight> === Diagnostics === After running the `docker-compose up` command, it should be possible to view the grid console in a browser at http://localhost:4444/grid/console. This should display the 5 Chrome nodes. '''TODO:''' Confirm that the Docker containers restart if the system reboots, e.g. after a system update. == Notes == === See also === * [https://github.com/dbarchowsky/job-alerts/wiki/Retrieving-JavaScript-Generated-Web-Content Retrieving JavaScript-Generated Web Content] - `job-alerts` app wiki on GitHub === References === <references /> [[Category:Python]][[Category:Web Development]]
Summary:
Please note that all contributions to Littledamien Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Littledamien Wiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information