Wednesday, April 18, 2018

Headless Firefox Scraping In AWS

As a follow-on to my Chrome post here is the process needed for Headless Firefox.  It's the same 5 steps re-tailored to a different browser.  You should read that post first.  At a minimum have it open in a separate browser window as I refer back many times.

The Process

Get Firefox Working

We are going with ubuntu again because of Firefox dependences.  And again I strongly advise against building things yourself.

Spin up an ubuntu and just "apt-get firefox" -- this is a lot easier than Chrome.

Get Scraping Working

Install selenium per the Chrome instructions.

Then we need to install the geckodriver.  Find the latest version number at https://github.com/mozilla/geckodriver/releases.  Then:

mkdir /geckodriver
wget -q --continue -P /geckodriver "https://github.com/mozilla/geckodriver/releases/download/v"driverversion"/geckodriver-v"driverversion"-linux64.tar.gz"
tar -zxvf /geckodriver/geckodriver-vversionumber-linux64.tar.gz -C /geckodriver
export PATH=/geckodriver:$PATH

Take a pause and make sure your scraper works with head-on Firefox before proceeding.

If you aren’t selenium-python fluent start with this simple code snippet adapted from the official docs here:

from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://google.com/?hl=en")
driver.quit()
quit()

That should open a Firefox window, load the Google home page and then shut everything down.

Get Headless Working

As with Chrome there are a bunch of options you need to set in/via selenium to get a fully-functioning environment.  Also you need to install xvfb with "apt-get xvfb" for some unknown reason.  Headless shouldn't require it but it appears to.

Create a temporary place to download files.  Handle this however you’d like.

downloadDirectory = tempfile.mkdtemp()

Create an options object so we can muck around.

from selenium.webdriver.firefox.options import Options
options = Options()

Set directory and make sure the browser will actually download.

profile = webdriver.FirefoxProfile()
profile.set_preference("browser.download.dir", downloadDirectory)
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.helperApps.neverAsk.saveToDisk",
 "application/pdf,application/x-pdf,application/octet-stream")


Run headless and without sandbox.  Note that we need both a profile and an options here, and the naming is different from before.

options.add_argument('-headless')
theDriver = webdriver.Firefox(firefox_profile=profile, options=options)

With that done we can use theDriver for headless scraping.  Rerun your tests from before.

Containerize It

We use a Docker container here.  Again you can probably do this with other technologies.  This one works and plays well with AWS.  Put whatever you need in requirements.txt for python — including selenium — and your Dockerfile is:

FROM ubuntu:17.10

RUN apt-get update \
 && apt-get install -y sudo \
 && apt-get install -y wget \
 && apt-get install -y curl \
 && apt-get install -y dpkg \
 && apt-get install -y software-properties-common \
 && apt-get install -y gdebi unzip

RUN apt-get update \
  && apt-get install -y python3.6 python3.6-dev python3-pip python3.6-venv \
  && apt-get install -y xvfb firefox

RUN mkdir /geckodriver
RUN wget -q --continue -P /geckodriver "https://github.com/mozilla/geckodriver/releases/download/v"$driverversion"/geckodriver-v"driverversion"-linux64.tar.gz"
RUN tar -zxvf /geckodriver/geckodriver-vdriverversion-linux64.tar.gz -C /geckodriver
ENV PATH /geckodriver:$PATH
ADD requirements.txt .
RUN pip3 install --trusted-host pypi.python.org -r ./requirements.txt

That container can fire up a working headless chrome that can download files.  Use whatever CMD and/or ENTRYPOINT suits you.

Firefox requires a lot of space in /dev/shm.  If you are going to run docker yourself be sure to add --shm-size 2g to the command.  If you are going to deploy in AWS read this.  Prior that feature we ran a customized machine image for a while that upped docker's default shm size with --default-shm-size in /etc/sysconfig/docker.  If your orchestration doesn't support shm-size you'll need to do something different.

Deploy

With a working container in hand deploy however you’d like.  We’ve run that image on AWS EC2 directly, using AWS ECS and Batch to EC2 and via Docker straight to MacOS, various Linuxes and Windows.  This image is about the same size as Chrome.

Memory usage is also heavy and depends a lot on what sorts of pages you are loading.

Some Comments

Far Cleaner Than Chrome

Firefox is a lot cleaner to set up and run than Chrome.  With the shm issue ironed out there is nothing too weird to do here.  The profile/options/parameters documentation could use some work.  That is the main reason for this post: getting downloads to work required a lot of grepping and guessing.

Firefox Deals With Changing HTML Better

When a page uses javascript to modify HTML without page reloads Firefox seems to do a better job.  That's why we use it sometimes.  I have no opinion between the two browsers except to say there are some complex pages out there that only scrape on one of them.

So no matter which browser you prefer it's worth getting both running.  I have found that where both browsers work the scraping code -- ex the initialization discussed here -- is identical.  And when I can't get one of them to work it's not a matter of modifying the code...it's just not going to work.

Surely there are edge cases I haven't found.  If you come across anything please let me know.

No comments:

Post a Comment

Moving Blog

 blog is moving to https://datafinnovation.medium.com/