Preface to first article

I really hated the design, implementation, and usage patterns around all build systems circa 2011. As some –now fewer– people know, I still hate the design, implementation and usage patterns around almost all build systems at the moment.

The typical build system is deep breath an under-provisioned multi-node, multi-operating system, multi-runtime, and multi-protocol distributed system with a constantly growing heterogenous workload. Big Sigh. Most teams don’t treat it as anything but a series of hacked together scripts, maintenance is often on-demand, and tasked engineers are affectionately called “build monkeys”.

I’m going to describe some of these patterns at a high level. I’ll also describe our current ‘best we can do’ solutions. Finally at the end of this series, I’m going to wax futuristic, and talk about what I think is the untapped potential of build systems.

build system (noun): a software system, or set of collaborating software systems, that mate with a human composed set of source files and, through an increasingly complicated process, births another digital artifact— usually another software system that provides value to someone.

Problem: “Works on my machine”

Local builds and local build systems are the least terrible part of what’s wrong with build systems. They’re local and so avoid the Fallacies of Distributed Computing, resulting in a simpler system. The people who need them are continually running and maintaining them, so they get a lot of real usage testing. They’re the critical path for individual contribution, avoiding the tragedy of the commons that arises around remote builds and Continuous Integration.

Most people and teams have accepted that having some kind of automation is good. Here’s some I’ve used in anger: ant, automake, batch, cmake, gmake, gradle, imake, leiningen, make, maven, mk, msbuild, nant, ninja, phing, powershell, rake, scons, shell, tup, waf

So, the first trivial task is getting all the tools, dependencies, source, and whatever else in the correct form on the correct computer and pressing a button.

Like that is ever going to happen. Everyone wants to use different tools, put in different places, running different versions, and would prefer that any version conflicts were resolved in their favour.

Current Solutions

Give up and make a Virtual Machine.
Put everything perfect on it. Viola: standardised snowflake. Several varieties of development environments are reduced to one. But, that one is still an arbitrary mishmash and maintained by no one in particular. To do really well, use something like Vagrant and Chef / Puppet to build that snowflake.

A bootstrap script. (./bootstrap.sh and ./go.sh seem popular)
A bootstrap script is just an ad hoc, informally-specified, bug-ridden, slow implementation of half of autoconf.

This terrifies me.

The script is usually broken into two parts: requirements check and environment setup. Requirements check sees if the system meets basic expectations (“package manager, git? postgres on the path? ruby?”). Environment setup builds on top of that install butts, bundle install, git clone ….

Most of these are written in shell, locking them into a Unix-like environment. Writing them in a scripting language is only marginally better. We break the normal “agile practices” rule and profusely comment, because the voodoo “my build doesn’t work, so I’ll re-run bootstrap” happens. This means it needs to be accessible for contributors of all skill levels to be able to understand what the hell each line in the script does. Worse, though, is that the script needs to be idempotent. Now, try explaining that concept to every potential contributor.

In the best world, the bootstrap script creates a local, isolated, project-specific, and hermetic development environment and is idempotent. Those are the 20% of the 80/20 rule. I usually give up past idempotency.

Problem: “I broke the build while fixing the build system.”

Changing the build process causes the whole build system to break for everyone. Things like:

  • Adding a new step e.g. test after deploy?
  • Changing a step e.g. directory structure change
  • Removing a step e.g. removing “make clean” because seriously

This happens most frequently in Continuous Integration and Continuous Delivery configurations. And usually results in hours of downtime. Why?

  • Configurations are not usually under version control. Go, at least, has an internal git repository.
  • Certain build configuration only works with a certain range of sources e.g. Change a target name? Change an artifact name? Welcome to the ‘coordinating the config change with the commit in the source’ game.
  • It’s hard to setup a “development” build system. Cloning jobs / pipelines / configs isn’t well supported. Merging them isn’t either. Our build agents might not even support this.

In short, the problem is build process configuration isn’t treated as a part of the software being built.

Current Solutions

“No configuration.”
All builds run a ./ci-build.sh and nothing else. Good luck policing that convention. Also, we’re probably screwed when it comes to artifacts.

Version the build config.
Go has this built-in. As far as I know, every other build server makes this hard to impossible. Regenerating old builds and artifacts is still hard though, so this only partially helps.

Use Travis CI – version the build config with the source.
I believe this is the only build system that versions a build configuration with the source being built. They had to do The Right Thing because of their open source community constraints.

Problem: “It’s running on the wrong box.”

Got a bunch of build agents? Some are reserved to certain builds? They have tags like “build,” “project-a”, or worse “scott’s box.” Time passes and there’s a bunch more builds and not a lot more resources. Time to tag by capability instead of resource: “windows”, “jruby”, “selenium.”

Except this only reduces suck. Why?

  • No one actually knows what is on each machine.
  • What is on each machine changes.
  • Some builds can break the machines.

Current Solutions

Validation builds. There is a dedicated “build,” but really let’s call it what it is, a job. That job runs the equivalent of a bootstrap script that ensures a machine is OK for some set of tags. Pass – a machine is ready to go into battle. Fail – there may be larger problems.

Phoenix agents. Blow away our build agents each time they’re used, periodically, or whenever our build goes red (in decreasing order of sanity). Another job adds/removes the agents appropriately.

Getting as close as possible to idempotent and immutable builds what we are aiming for. And to do that, everything needs to be versioned under a single identifier and have a consistent build environment.

The way we design, implement and use build systems sucks. But we haven’t gone over enough of the existing problems and current solutions for me to credibly present a better model. Stay tuned for the next installment of I IMMENSELY DISLIKE BUILD SYSTEMS.