Friday, April 6, 2018

Thoughts On Managing Docker Images With Makefiles

Containers are fantastic.  Docker is an excellent container platform.  And there are huge benefits to containerizing applications for production deployment.  We have realized large efficiency gains with cloud-hosted software by moving to a container-centric design.

However there is one large gap you will inevitably run across when assembling a system around Dockerfiles: applying all your cherished best-practices simultaneously when it comes to building and distributed sets of containers.

Any reasonably-sized system is going to involve at least a few container images.  This is a pretty obvious consequence of following Docker's own best practices.  You're going to put all the dependencies and libraries into some base image or images.  And then you'll build specific components into their own images.  So you'll have a small hierarchy of inter-dependent Dockerfiles.  Those Dockerfiles probably rely on some code out of github and need to publish images into some cloud service like AWS ECR, dockerhub or some other cloud-y container repository.

As far as I can tell no one build tool can handle all of these dependencies cleanly.  With that in mind once I'd accepted a partial / kludgy solution was unavoidable I started down the road of using good old reliable Makefiles to manage things.

It's important to say here that our in-container code is in python.  The build process here does not involve any compilation.  Some of what we do for container builds is a bad idea for compiling code.  And properly getting compiled code into a docker container has it's own set of issues.  All of that is outside the scope of this discussion.

So let's take this step-by-step.

Step 1: Assembling Your Components

Let's assume your code is hosted in github.  So you have two separate build cases right off the bat: based on a tag in github or based on some locally-stored collection of code.  It is essential you can build from locally-modified code to test things without running through the whole check-in and tagging process.  And we need a single solution that handles both cases -- if the environment starts fracturing on step #1 all is lost.

The simplest solution I've found is to build the container around a single archive containing all your code -- and then writing two separate ways to generate the file.  Let's call that file "git.tar.gz" to keep it simple.  The makefile then looks like:

local:
     tar zcvf git.tar.gz $(PATH_TO_CODE)
git:
     git -C git clone --branch $(GIT_BRANCH) https://github.com/...
     tar zcvf git.tar.gz git
clean:
     rm -rf git git.tar.gz

It's up to you to ensure git.tar.gz exists before any subsequent rules try to access it.  That's terrible because it side-steps the best thing about Makefiles -- smart dependency resolution.  But there is no clean way to hook git timestamps into make.

You can write a rule to build git.tar.gz directly and declare it as a pre-requisite for some downstream targets if you want.   You can also try ADDing your local source directory instead of rolling it into an archive.  And you can directly download a it release archive -- unless you want to try this with an un-released branch tag.  All of those partial solutions lead to parallel streams of targets in the Makefile.  Maybe those are better sometimes.  This all depends whether you'd rather be careful about Makefile maintenance or stating target names.  I've found running it all through a common archive target makes this harder to misuse a few months down the road when none of this is fresh in my mind.

Docker does a great job handling the .tar.gz file so we know things are smoother once that's built.  All you need is an "ADD..." in the relevant Dockerfile and the code is ready to use.  And because it's archived-up you won't ever get confused by local changes which aren't in the container because you edited the file after (or during!) the build.

Step 2: Building Containers

If your project has any size to it you've got a hierarchy of containers.  That lends itself well to a hierarchy of directories and a tested set of Makefiles.  Docker does an admirable job of handling dependency issues for targets stated clearly in the Dockerfile.  Our goal here is to wrap up the "docker build...." commands in such a way that typos and cross-wires across images are not possible.

We start by borrowing from Makefile best practices.  Create a config.mk file at the top of your directory structure and set all config variables here. Then in each and every Makefile begin with:

TOP=../../../ # whatever the path back is
include $(TOP)/config.mk

That config file should contain all the GIT_BRANCH info and code locations etc. From the top level set up a directory structure that maps onto your dependencies.  Remember that make sorts out dependencies and build order internally for named targets -- but you can assume total control for how make works through a directory structure simply by writing recursive targets.  Imagine we have two directories off the top: base and app.  Then in the top level Makefile we can write:

all: base app

base:
     make -C base base_1
     make -C base base_2
     make -C base base_3
app:
     make -C app app_1
     make -C app app_2

With this structure we can safely have base_2's Docker image depend on base_1 and know it will always be built second.  We can also have our app images depend on the base images and never worry that things are out of

This additional make-based machinery allows us to handle complex cross-Dockerfile builds.  Docker's inter-container dependency logic avoids a lot of unnecessary rebuilding within each "docker build" command.  All that remains is to write the Makefile targets to invoke those commands.  The file for base_1 looks like:

TOP=../../
include $(TOP)/config.mk
THIS_DOCKER_IMAGE_NAME=base_1
default: docker_image
docker_image:
     docker build -t $(THIS_DOCKER_IMAGE_NAME)/$(IMAGE_VERSION)

Make sure you keep the image name set properly in each directory and pull consistent version tags from config.mk . Everything is clean.  And this infrastructure gets us a consistent, reliably rebuildable set of images attached to whatever Docker server we are attached to.  You can write tests with targets that call "docker run" in place of whatever you'd normally expect from a "make check."

Step 3: Log In To Container Repository

What you need to do here depends on where your containers are ultimately going.  With can handle this with a "repo-login" target.

repo-login-aws:
     `aws ecr get-login ...`
repo-login-dockerhub:
     docker login ...
repo-login: repo-login-aws repo-login-dockerhub

If you are working with a single repo this is pretty simple.  But the beauty of this scheme is that you can hook into any number of repos at once.  Good practice dictates relying on interactive password entry or relying on the authentication commands to pull from private files or so.  Either way don't put passwords in the makefiles and you are ok.

Step 4: Publishing Containers

All that remains is to get our set of containers off the "local" Docker and into whatever archive they belong in.  Once signed in we can push images in a simple manner with some simple make coding:

tag-base_1: repo-login
     docker tag base_1:$(IMAGE_VERSION) $(REPO_ADDRESS)/base_1:$(IMAGE_VERSION)
push-base_1: repo-login
     docker push $(REPO_ADDRESS)/base_1:$(IMAGE_VERSION)
do: tag-base_1 push-base_1

You can either place all of this in the top level Makefile or build a recursive descent similar to the build process discussed above.  If you have a variety of repositories this isn't much harder: you've then got a set of REPO_ADDRESS_1, REPO_ADDRESS_2,... and need to ensure the repo-login target hooks in to all of them.  The makefiles get a bit sloppier but for a multi-repo scheme there probably isn't a way around some mess.

Testing Dependency Changes

One of the best parts of this approach is how easy it is to test against new versions of dependencies.  Let's assume you have a test target which runs your test code inside a freshly-built container.  Testing with an updated dependency is as simple as checking out a fresh version of the makefile repo, updating the requirements.txt or Dockerfile in-situ and running make test.

This doesn't feel like a huge win when it comes to updating the version of a module in requirements.txt.  But let's say you depend on Chrome for some web scraping.  Somewhere in your setup is a Dockerfile that pulls in versions of Chrome and chromedriver and whatever they depend on in your world.  Now you can modify that Dockerfile and rerun the test safely.  It's not so easy to temporarily upgrade Chrome on your development machine.  And you may loose an entire afternoon if the test fails.  Just search for "how to downgrade chrome version" and you'll be sold.

Once the test is passed you just check-in the modified Dockerfile and you're done.  That's as easy as it gets.  Oh and it goes without saying that every requirements.txt file should list version numbers explicitly using name==A.B.C.

Version Controlling All Of This

At this point it's a good idea to go back and ensure you have a good "make clean" that traverses the whole directly structure and removes everything.  Then it's simple enough to import the whole hierarchy of Dockerfiles, Makefiles and anything else you need along the way into github.

We've got this kind of scheme in place for a large python project.  All the requirements.txt files are in github and the make and Docker commands which use them are right next door.  The dependency management isn't perfect -- but it's not too bad either.

Other Build Systems (Gradle, ANT, etc)

Yes you can do this with a lot of more "modern" build tools.  I have been through those exercises in some detail.  If I can get a minimal version of a task completed with a dozen lines of make code I have 0 interest in hearing about plugins.

For particularly large or complex projects -- or life inside a project already using those tools -- they offer great power.  If you mainly want a way to version control your requirements.txt files and automate the docker build/tag/push process they require too much effort.

I'll also note that the considerations around build environment depend (ha!) heavily on whether there is a user-invoked compiler step.  With python I can safely toss around archives containing 10k lines of code and not care about timestamps.  That's not an advantage of python per-se (it's also terrible in some ways).  But it's a face and this scheme would never work for C++.

No comments:

Post a Comment

Moving Blog

 blog is moving to https://datafinnovation.medium.com/