Friday, February 2, 2018

Headless Chrome Web Scraping In The Cloud

Headless Chrome is here.  But running a headless browser on your desktop is pretty useless.  What does it take to get a Chrome-based scraper up and running, in a container, with the ability to download files, scaled up in the cloud?  There were more wrinkles than I expected but in the end it's not too hard.

I'm not going to tell this story the way it unfolded -- that was too messy.  Rather I'll show you how to assemble a working environment.  It's a 5 step process.

Step 1 - Get Chrome working on your preferred platform.  This forces you to understand and sort out Chrome's many (many, many) dependencies.  Short and sweet when it works and very much the opposite when it doesn't.

Step 2 - Get scraping working.  Chrome isn't a scraper -- even more parts are needed.

Step 3 - Get headless mode working on that platform with downloads.  There is more hair here than expected.

Step 4 - Containerize it.  Straightforward if steps 1-3 were smooth.

Step 5 - Deploy.

A Short(ish) Walkthrough

Get Chrome Working

Chrome, not surprisingly, has lots of dependencies.  If you are infinitely patient you can go download the source and start building everything for your preferred platform.  Don't do this.  Really don't.  Forget that it takes hours to compile and left a 75GB build directory.  It gets released a lot.  Pick a platform supported here and move on.  As we are targeting the cloud we chose Linux.

But which linux?  Amazon Linux, arguably the most cloud-y of choices, doesn't include some of Chrome's dependencies.  You can go off and build them yourself if you want.  But this is headless web browsing -- it's pretty inconceivable the OS flavor will impact performance.  And we are going to run one application in it's own container.  Just pick something that works.  We're going with ubuntu as it has a nice chrome package already available.

Spin up an ubuntu (we are using version 17.10 as of this writing), make sure you've got curl, deb, wget and the rest of the basic piping in place and:

wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
deb http://dl.google.com/linux/chrome/deb/ stable main
apt-get update
apt-get install -y google-chrome-stable

That gets us a working Chrome fresh from Google.  If you run this on a desktop machine you've got a GUI and all the other bells and whistles.  At a minimum try "google-chrome --version" and make sure it runs.  As of this writing I get "Google Chrome 63.0.3239.108" back.

Get Scraping Working

We are going to use selenium on python 3.6 to scrape.  You can probably use other technologies -- we know this one works end-to-end.  Given some of the oddness uncovered below I'd recommend working through the selenium-python solution before trying anything else.

The python part is 100% standard:

apt-get install -y python3.6 python3.6-dev python3-pip python3.6-venv
pip3 install selenium

Then we need to get chromedriver -- the bridge between selenium and Chrome -- in place.  Find the latest version number at https://sites.google.com/a/chromium.org/chromedriver/downloads.  Then:

mkdir /chromedriver
wget -q --continue -P /chromedriver "http://chromedriver.storage.googleapis.com/VERSION/chromedriver_linux64.zip"
unzip /chromedriver/chromedriver*zip -d /chromedriver
export PATH=/chromedriver:$PATH

You probably don't want to use that path.  And note that chromedriver doesn't appear to have a generic "latest" yet (you can grab the version number from a file and re-assemble it).  Installation is a bit more hands-on than Chrome itself.

Now we have a working python-selenium-chrome scraping environment.  Take a pause and make sure your scraper works with head-on Chrome before proceeding.  In the next step we leave the piste -- it is essential everything works now.

If you aren't selenium-python fluent start with this simple code snippet adapted from the official docs here:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://google.com/?hl=en")
driver.quit()
quit()

That should open a Chrome window, load the Google home page and then shut everything down.

Chromedriver Logging

The chromedriver application has it's own logging facility that is helpful when sorting out configuration problems.  But you don't want to add more flags in the middle of a debugging exercise.  So here's a little trick that helps out: rename chromedriver to chromedriver_binary.  And put a new shell script called chromedriver in your path:

#/bin/bash
/path/to/chromedriver_binary --log-path=/tmp/cd.log --verbose "$@"

It's a blunt instrument but it works.  And there is a lot of information in those logs.

Get Headless Working

The Chrome we've installed will work just fine if you pass it the --headless option on the command line.  But that's not enough to do useful scraping.  To hook up a headless browser to the python scraping environment and get full functionality we need to be a bit careful about how we initialize selenium.

Create a temporary place to download files.  Handle this however you'd like.

downloadDirectory = tempfile.mkdtemp()

Create an options object so we can muck around.

chromeOptions = webdriver.ChromeOptions()

Set directory and disable images as it's headless.

prefs = { "download.default_directory" : downloadDirectory ,
          "profile.managed_default_content_settings.images" : 2 }
chromeOptions.add_experimental_option("prefs",prefs)

Run chrome headless and without sandbox.  The sandbox flag is discussed below.  Some sources tell you to --disable-gpu but I've not found that to be necessary.

chromeOptions.add_argument('--headless')
chromeOptions.add_argument('--no-sandbox')
theDriver = webdriver.Chrome(chrome_options=chromeOptions)

We need to run this special code to get headless downloads working.

theDriver.command_executor._commands["send_command"] = ("POST",'/session/$sessionId/chromium')
params={cmd : 'Page.setDownloadBehavior','params': {'behavior': 'allow','downloadPath': downloadDirectory}}
command_result = theDriver.execute("send_command",params)

With that done we can use theDriver for headless scraping.  Rerun your tests from before.

Containerize It

We use a Docker container here.  Again you can probably do this with other technologies.  This one works and plays well with AWS.  Put whatever you need in requirements.txt for python -- including selenium -- and your Dockerfile is:

FROM ubuntu:17.10
RUN apt-get update \
 && apt-get install -y wget \
 && apt-get install -y curl \
 && apt-get install -y dpkg \
 && apt-get install -y gdebi unzip
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
RUN echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list
RUN apt-get update \
 && apt-get install -y python3.6 python3.6-dev python3-pip python3.6-venv\
 && apt-get install -y google-chrome-stable
RUN mkdir /chromedriver
RUN wget -q --continue -P /chromedriver "http://chromedriver.storage.googleapis.com/VERSION/chromedriver_linux64.zip"
RUN unzip /chromedriver/chromedriver*zip -d /chromedriver
ENV PATH /chromedriver:$PATH
ADD requirements.txt .
RUN pip3 install --trusted-host pypi.python.org -r ./requirements.txt

That container can fire up a working headless chrome that can download files.  Use whatever CMD and/or ENTRYPOINT suits you.

Deploy

With a working container in hand deploy however you'd like.  We've run that image on AWS EC2 directly, using AWS ECS and Batch to EC2 and via Docker straight to MacOS, various Linuxes and Windows.  It's not a small image -- Chrome adds ~300MB -- but it works.

Memory usage is also heavy and depends a lot on what sorts of pages you are loading.  Maybe someone can get a Raspberry Pi Chrome scraper working.  Performance is all network-limited anyway.  We've been using AWS instances between t2.micro and c4.large with no problems.

Some Comments

This Should Get Cleaner in 2018

A lot of this process is messy because we are cobbling together a big complex system with tools that aren't all ready for prime time.  Compare the Chrome and chromedriver download webpages -- one of these isn't polished.  Here's hoping for a post later in the year with a simpler scheme.

Why ubuntu?

It is possible to start from a thinner base image than ubuntu.  But Chrome has so many dependencies I don't think it's worth it.  Feel free to try -- the process will surely be longer even if the resulting image is a bit smaller.  Building everything yourself from scratch also isn't worth it.  Having been through the exercise once I learned a lot -- about Chrome's build tools.  But nothing related to scraping at all.  And maintaining a from-scratch setup will be painful and grow old in a hurry.

Does it matter that we might not have the latest version of everything?  Maybe.  Do I have a real preference for ubuntu?  No.  But there's a nice free Docker image and it works.  The real security problems are inside the random JavaScript your headless browser will be running.  If you want to use something else it'll be fine too.  This stack runs the gamut from super-solid to bleeding-edge.  The OS isn't the weak point.

Selenium Oddness

Most of the setup for the selenium webdriver is hacky.  The --no-sandbox option is dangerous -- except chrome won't run as root without it.  If you try it will tell you directly that "Running as root without --no-sandbox is not supported. See https://crbug.com/638180."  If you want to run as a different user in the container then you can drop the sandbox flag.  Yes I know root inside a container is a bad practice -- I'm trying to set up cloud-hosted headless Chrome not teach a class on container security and that complicates an already convoluted base image and process.  Change users in your own images if you want and adjust the flags accordingly.  Remember that both Chrome and chromedriver need to be in the PATH of whatever user you select.

Headless Downloads

Lastly that special POST to get downloads working comes from a chromium bug report discussion.  In my opinion a lot of that chain misses the real issues.  There is concern about the safety of allowing headless chrome to drop files locally.  Really?  Headless chrome users are clearly at the savvier end of the web user spectrum.  It's unlikely lots of sensitive personal information is stored on machines used for automated scraping.  Meanwhile regular old with-GUI Chrome is pretty much guaranteed to be run by a user with no understanding of (web) security on a computer chocked full of sensitive information, images, emails and like.  Plus they are likely typing in credit card numbers all the time.

If you can trick my scraper into downloading malware, somehow get it running, break out of the Docker container and then defeat my AWS security settings: congratulations.  I find it hard to believe this use case, and the browser's small role therein, should be high priority for the Chrome team.

Yes the whole how-to-handle-file-downloads thing is non-standard.  And yes file downloads via browsers are a massive security risk.  But adding some hoops for headless mode doesn't address any of that.  I want headless Chrome to work just like regular Chrome without a screen -- and all that security-related stuff should similarly be identical.

Anyway use the code snippet above and it'll work just fine.

Security

I've not focused on security here.  It's important but also beyond the scope of this post.  Yes part of the motivation for the container is security -- but we are using Docker primarily as a deployment tool.  It is a great way of rolling up the not-so-clean installation process.  What's presented here isn't a secure setup -- it's a working setup that you still need to secure.

The largest security risk here is letting Chrome loose on the web, munching JavaScript and HTML5 to it's heart's content.  Box up your cloud as tight as you can, don't attach any persistent filesystems to your containers, restart things often and monitor monitor monitor.  If you want a fleet of computers clicking JavaScript buttons in infinite loops it's all about limiting the damage that can be done.

Oh, and making sure you are up-to-date on versions.  Yet another reason to use pre-built Chrome and chromedriver binaries and a standard OS image -- you can rebuild the containers in seconds.

Happy scraping!

Moving Blog

 blog is moving to https://datafinnovation.medium.com/