What the hell is “devfrastructure”?
Well, I’m glad you asked. I’m talking about those systems that end users probably wouldn’t even know existed, but which the development team, and ops for that matter, depends on to get its work done. Things like the continuous integration server or the central git repo.
I’ve recently been having a debate with the sys admin around here over how these systems should be managed and who should be responsible for doing so. In short, I (a developer mainly but who has recently been writing and deploying a bit of chef code) think they’re important enough to be considered ‘production’, and hence should treated as such (e.g. decent hardware and/or stable VMs and environments in general, regular backups, recovery plans etc).
The sys admin (who knows his stuff and whose opinion I respect) argues that these systems are not that important, since end users wouldn’t notice if they were down, and hence management and support of them shouldn’t be given the same priority as our true production systems.
Murphy’s Law in Action
This debate has been going on in the background for a while, but it really came to a head last week when the machine on which our jenkins server runs died with a hardware fault. Apart from CI, we use jenkins to build application artefacts (e.g. jar files etc), which are in turn pulled down and deployed by our chef runs.
So, apart from being without CI, jenkins being down meant that we could not deploy our production apps using the same process as we normally would. It also meant that chef runs on any of our production boxes that tried to access jenkins failed.
Several people were tied up for a day or so recovering things and moving jenkins from A to B and then to C. Several problems came up and to cut a long story short, we didn’t really have a proper jenkins at the end of it. My view is that if the jenkins server were treated like a production system, it would’ve been much easier and quicker to recover in the event of this kind of failure.
Another example: in the course of our development lately, we’ve been running up a lot of VMs with vagrant and provisioning them with chef. I wrote some chef code to set up a local apt-cacher instance and to make use of it during vagrant/chef-solo provisioning. This sped up provisioning by a factor of five in some cases (some packages, like open-jdk and postgresql, take a while to download). As Murphy’s Law would have it, I’d put the apt-cacher repo on the same server as jenkins, and as it wasn’t considered important, it was not migrated/recovered along with the jenkins app itself. I’ve spent a lot of time waiting around for VMs to be provisioned since!
Although it is correct that end users would have been oblivious to all of the above, I argue that they have been impacted because of the delayed releases that resulted or will result from this outage.
Could/should devrastructure be relied upon?
What I would like us to do is to treat these “devfrastructure” systems the same way as any of our other production systems, so that they can be relied upon, so that it’s easy to recover properly from a catastrophic hardware failure etc etc.
I’m interested to hear what others think about this, as it seems the sys admin and I are going to have to agree to disagree on this one, so I’d like to get some other opinions on the matter.
* thanks to @tofojo for coining the term “devfrastructure”