Wednesday, April 25, 2018

Python Web Scraping Talk

Recently had an opportunity to talk about our experiences using python to pull in web data:

https://www.youtube.com/watch?v=ee4VJ2ohLaw&t=1s

Wednesday, April 18, 2018

Headless Firefox Scraping In AWS

As a follow-on to my Chrome post here is the process needed for Headless Firefox. It's the same 5 steps re-tailored to a different browser. You should read that post first. At a minimum have it open in a separate browser window as I refer back many times.

The Process

Get Firefox Working

We are going with ubuntu again because of Firefox dependences. And again I strongly advise against building things yourself.

Spin up an ubuntu and just "apt-get firefox" -- this is a lot easier than Chrome.

Get Scraping Working

Install selenium per the Chrome instructions.

Then we need to install the geckodriver. Find the latest version number at https://github.com/mozilla/geckodriver/releases. Then:

mkdir /geckodriver
wget -q --continue -P /geckodriver "https://github.com/mozilla/geckodriver/releases/download/v"driverversion"/geckodriver-v"driverversion"-linux64.tar.gz"
tar -zxvf /geckodriver/geckodriver-vversionumber-linux64.tar.gz -C /geckodriver
export PATH=/geckodriver:$PATH

Take a pause and make sure your scraper works with head-on Firefox before proceeding.

If you aren’t selenium-python fluent start with this simple code snippet adapted from the official docs here:

from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://google.com/?hl=en")
driver.quit()
quit()

That should open a Firefox window, load the Google home page and then shut everything down.

Get Headless Working

As with Chrome there are a bunch of options you need to set in/via selenium to get a fully-functioning environment. Also you need to install xvfb with "apt-get xvfb" for some unknown reason. Headless shouldn't require it but it appears to.

Create a temporary place to download files. Handle this however you’d like.

downloadDirectory = tempfile.mkdtemp()

Create an options object so we can muck around.

from selenium.webdriver.firefox.options import Options
options = Options()

Set directory and make sure the browser will actually download.

profile = webdriver.FirefoxProfile()
profile.set_preference("browser.download.dir", downloadDirectory)
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.helperApps.neverAsk.saveToDisk",
 "application/pdf,application/x-pdf,application/octet-stream")

Run headless and without sandbox. Note that we need both a profile and an options here, and the naming is different from before.

options.add_argument('-headless')
theDriver = webdriver.Firefox(firefox_profile=profile, options=options)

With that done we can use theDriver for headless scraping. Rerun your tests from before.

Containerize It

We use a Docker container here. Again you can probably do this with other technologies. This one works and plays well with AWS. Put whatever you need in requirements.txt for python — including selenium — and your Dockerfile is:

FROM ubuntu:17.10

RUN apt-get update \
 && apt-get install -y sudo \
 && apt-get install -y wget \
 && apt-get install -y curl \
 && apt-get install -y dpkg \
 && apt-get install -y software-properties-common \
 && apt-get install -y gdebi unzip

RUN apt-get update \
  && apt-get install -y python3.6 python3.6-dev python3-pip python3.6-venv \
  && apt-get install -y xvfb firefox

RUN mkdir /geckodriver
RUN wget -q --continue -P /geckodriver "https://github.com/mozilla/geckodriver/releases/download/v"$driverversion"/geckodriver-v"driverversion"-linux64.tar.gz"
RUN tar -zxvf /geckodriver/geckodriver-vdriverversion-linux64.tar.gz -C /geckodriver
ENV PATH /geckodriver:$PATH
ADD requirements.txt .
RUN pip3 install --trusted-host pypi.python.org -r ./requirements.txt

That container can fire up a working headless chrome that can download files. Use whatever CMD and/or ENTRYPOINT suits you.

Firefox requires a lot of space in /dev/shm. If you are going to run docker yourself be sure to add --shm-size 2g to the command. If you are going to deploy in AWS read this. Prior that feature we ran a customized machine image for a while that upped docker's default shm size with --default-shm-size in /etc/sysconfig/docker. If your orchestration doesn't support shm-size you'll need to do something different.

Deploy

With a working container in hand deploy however you’d like. We’ve run that image on AWS EC2 directly, using AWS ECS and Batch to EC2 and via Docker straight to MacOS, various Linuxes and Windows. This image is about the same size as Chrome.

Memory usage is also heavy and depends a lot on what sorts of pages you are loading.

Some Comments

Far Cleaner Than Chrome

Firefox is a lot cleaner to set up and run than Chrome. With the shm issue ironed out there is nothing too weird to do here. The profile/options/parameters documentation could use some work. That is the main reason for this post: getting downloads to work required a lot of grepping and guessing.

Firefox Deals With Changing HTML Better

When a page uses javascript to modify HTML without page reloads Firefox seems to do a better job. That's why we use it sometimes. I have no opinion between the two browsers except to say there are some complex pages out there that only scrape on one of them.

So no matter which browser you prefer it's worth getting both running. I have found that where both browsers work the scraping code -- ex the initialization discussed here -- is identical. And when I can't get one of them to work it's not a matter of modifying the code...it's just not going to work.

Surely there are edge cases I haven't found. If you come across anything please let me know.

Friday, April 6, 2018

Thoughts On Managing Docker Images With Makefiles

Containers are fantastic. Docker is an excellent container platform. And there are huge benefits to containerizing applications for production deployment. We have realized large efficiency gains with cloud-hosted software by moving to a container-centric design.

However there is one large gap you will inevitably run across when assembling a system around Dockerfiles: applying all your cherished best-practices simultaneously when it comes to building and distributed sets of containers.

Any reasonably-sized system is going to involve at least a few container images. This is a pretty obvious consequence of following Docker's own best practices. You're going to put all the dependencies and libraries into some base image or images. And then you'll build specific components into their own images. So you'll have a small hierarchy of inter-dependent Dockerfiles. Those Dockerfiles probably rely on some code out of github and need to publish images into some cloud service like AWS ECR, dockerhub or some other cloud-y container repository.

As far as I can tell no one build tool can handle all of these dependencies cleanly. With that in mind once I'd accepted a partial / kludgy solution was unavoidable I started down the road of using good old reliable Makefiles to manage things.

It's important to say here that our in-container code is in python. The build process here does not involve any compilation. Some of what we do for container builds is a bad idea for compiling code. And properly getting compiled code into a docker container has it's own set of issues. All of that is outside the scope of this discussion.

So let's take this step-by-step.

Step 1: Assembling Your Components

Let's assume your code is hosted in github. So you have two separate build cases right off the bat: based on a tag in github or based on some locally-stored collection of code. It is essential you can build from locally-modified code to test things without running through the whole check-in and tagging process. And we need a single solution that handles both cases -- if the environment starts fracturing on step #1 all is lost.

The simplest solution I've found is to build the container around a single archive containing all your code -- and then writing two separate ways to generate the file. Let's call that file "git.tar.gz" to keep it simple. The makefile then looks like:

local:
     tar zcvf git.tar.gz $(PATH_TO_CODE)
git:
     git -C git clone --branch $(GIT_BRANCH) https://github.com/...
     tar zcvf git.tar.gz git
clean:
     rm -rf git git.tar.gz

It's up to you to ensure git.tar.gz exists before any subsequent rules try to access it. That's terrible because it side-steps the best thing about Makefiles -- smart dependency resolution. But there is no clean way to hook git timestamps into make.

You can write a rule to build git.tar.gz directly and declare it as a pre-requisite for some downstream targets if you want. You can also try ADDing your local source directory instead of rolling it into an archive. And you can directly download a it release archive -- unless you want to try this with an un-released branch tag. All of those partial solutions lead to parallel streams of targets in the Makefile. Maybe those are better sometimes. This all depends whether you'd rather be careful about Makefile maintenance or stating target names. I've found running it all through a common archive target makes this harder to misuse a few months down the road when none of this is fresh in my mind.

Docker does a great job handling the .tar.gz file so we know things are smoother once that's built. All you need is an "ADD..." in the relevant Dockerfile and the code is ready to use. And because it's archived-up you won't ever get confused by local changes which aren't in the container because you edited the file after (or during!) the build.

Step 2: Building Containers

If your project has any size to it you've got a hierarchy of containers. That lends itself well to a hierarchy of directories and a tested set of Makefiles. Docker does an admirable job of handling dependency issues for targets stated clearly in the Dockerfile. Our goal here is to wrap up the "docker build...." commands in such a way that typos and cross-wires across images are not possible.

We start by borrowing from Makefile best practices. Create a config.mk file at the top of your directory structure and set all config variables here. Then in each and every Makefile begin with:

TOP=../../../ # whatever the path back is
include $(TOP)/config.mk

That config file should contain all the GIT_BRANCH info and code locations etc. From the top level set up a directory structure that maps onto your dependencies. Remember that make sorts out dependencies and build order internally for named targets -- but you can assume total control for how make works through a directory structure simply by writing recursive targets. Imagine we have two directories off the top: base and app. Then in the top level Makefile we can write:

all: base app

base:
     make -C base base_1
     make -C base base_2
     make -C base base_3
app:
     make -C app app_1
     make -C app app_2

With this structure we can safely have base_2's Docker image depend on base_1 and know it will always be built second. We can also have our app images depend on the base images and never worry that things are out of

This additional make-based machinery allows us to handle complex cross-Dockerfile builds. Docker's inter-container dependency logic avoids a lot of unnecessary rebuilding within each "docker build" command. All that remains is to write the Makefile targets to invoke those commands. The file for base_1 looks like:

TOP=../../
include $(TOP)/config.mk
THIS_DOCKER_IMAGE_NAME=base_1
default: docker_image
docker_image:
     docker build -t $(THIS_DOCKER_IMAGE_NAME)/$(IMAGE_VERSION)

Make sure you keep the image name set properly in each directory and pull consistent version tags from config.mk . Everything is clean. And this infrastructure gets us a consistent, reliably rebuildable set of images attached to whatever Docker server we are attached to. You can write tests with targets that call "docker run" in place of whatever you'd normally expect from a "make check."

Step 3: Log In To Container Repository

What you need to do here depends on where your containers are ultimately going. With can handle this with a "repo-login" target.

repo-login-aws:
     `aws ecr get-login ...`
repo-login-dockerhub:
     docker login ...
repo-login: repo-login-aws repo-login-dockerhub

If you are working with a single repo this is pretty simple. But the beauty of this scheme is that you can hook into any number of repos at once. Good practice dictates relying on interactive password entry or relying on the authentication commands to pull from private files or so. Either way don't put passwords in the makefiles and you are ok.

Step 4: Publishing Containers

All that remains is to get our set of containers off the "local" Docker and into whatever archive they belong in. Once signed in we can push images in a simple manner with some simple make coding:

tag-base_1: repo-login
     docker tag base_1:$(IMAGE_VERSION) $(REPO_ADDRESS)/base_1:$(IMAGE_VERSION)
push-base_1: repo-login
     docker push $(REPO_ADDRESS)/base_1:$(IMAGE_VERSION)
do: tag-base_1 push-base_1

You can either place all of this in the top level Makefile or build a recursive descent similar to the build process discussed above. If you have a variety of repositories this isn't much harder: you've then got a set of REPO_ADDRESS_1, REPO_ADDRESS_2,... and need to ensure the repo-login target hooks in to all of them. The makefiles get a bit sloppier but for a multi-repo scheme there probably isn't a way around some mess.

Testing Dependency Changes

One of the best parts of this approach is how easy it is to test against new versions of dependencies. Let's assume you have a test target which runs your test code inside a freshly-built container. Testing with an updated dependency is as simple as checking out a fresh version of the makefile repo, updating the requirements.txt or Dockerfile in-situ and running make test.

This doesn't feel like a huge win when it comes to updating the version of a module in requirements.txt. But let's say you depend on Chrome for some web scraping. Somewhere in your setup is a Dockerfile that pulls in versions of Chrome and chromedriver and whatever they depend on in your world. Now you can modify that Dockerfile and rerun the test safely. It's not so easy to temporarily upgrade Chrome on your development machine. And you may loose an entire afternoon if the test fails. Just search for "how to downgrade chrome version" and you'll be sold.

Once the test is passed you just check-in the modified Dockerfile and you're done. That's as easy as it gets. Oh and it goes without saying that every requirements.txt file should list version numbers explicitly using name==A.B.C.

Version Controlling All Of This

At this point it's a good idea to go back and ensure you have a good "make clean" that traverses the whole directly structure and removes everything. Then it's simple enough to import the whole hierarchy of Dockerfiles, Makefiles and anything else you need along the way into github.

We've got this kind of scheme in place for a large python project. All the requirements.txt files are in github and the make and Docker commands which use them are right next door. The dependency management isn't perfect -- but it's not too bad either.

Other Build Systems (Gradle, ANT, etc)

Yes you can do this with a lot of more "modern" build tools. I have been through those exercises in some detail. If I can get a minimal version of a task completed with a dozen lines of make code I have 0 interest in hearing about plugins.

For particularly large or complex projects -- or life inside a project already using those tools -- they offer great power. If you mainly want a way to version control your requirements.txt files and automate the docker build/tag/push process they require too much effort.

I'll also note that the considerations around build environment depend (ha!) heavily on whether there is a user-invoked compiler step. With python I can safely toss around archives containing 10k lines of code and not care about timestamps. That's not an advantage of python per-se (it's also terrible in some ways). But it's a face and this scheme would never work for C++.

Data Finnovation