The DevOps Identity Crisis

Why DevOps needs a manifesto after all, but may never get one.


This article originally appeared on O’Reilly Radar.

DevOps is everywhere! The growth and mindshare of the movement is remarkable. But if you care deeply about DevOps, you might agree with me when I say that although its moment has “arrived,” DevOps is in serious trouble. The movement is fragmented and weakly defined, and is being usurped by those who care more about short-term opportunities than the long-term viability of DevOps.

» Continue Reading (about 1600 words)

Why Deployment Freezes Don't Prevent Outages

I have $10 that says you’ve experienced this before: there’s a holiday, trade show, or other important event coming up. Management is worried about the risk of an outage during this all-important time, and restricts deployments from the week prior through the end of the event.

What really happens, of course, is that the system in question becomes booby-trapped with extra risk. As a result, problems are more likely, and when there there is even a slight issue, it has the potential to escalate into a major crisis.

Why does this happen? As usual, there’s no single root cause, but a variety of problems combine to create a brittle, risky situation.



When managers declare a freeze, they’re not being malicious. They’re doing something that seems to make sense. That’s why it’s important to understand the reasoning.

» Continue Reading (about 2100 words)

The Root Cause Fallacy

Wouldn’t you like to find the root cause of that downtime incident? Many people would. But experience has taught me that there is no such thing as a single root cause. Instead, there’s a chain of interrelated causes, each of which is necessary but none of which is sufficient to cause the overall problem.


» Continue Reading (about 600 words)

Amber Alert: Worse Than Nothing?

In the last few years, there’s been a lot of discussion about alerts in the circles I move in. There’s general agreement that a lot of tools don’t provide good alerting mechanisms, including problems such as unclear alerts, alerts that can’t be acted upon, and alerts that lack context. Yesterday and today at the Strata conference, my phone and lots of phones around me started blaring klaxon sounds. When I looked at my phone, I saw something like this (the screenshot is from a later update, but otherwise similar):

» Continue Reading (about 500 words)

Generating Realistic Time Series Data

I am interested in compiling a list of techniques to generate fake time-series data that looks and behaves realistically. The goal is to make a mock API for developers to work against, without needing bulky sets of real data, which are annoying to deal with, especially as things change and new types of data are needed. To achieve this, I think several specific things need to be addressed: What common classes or categories of time-series data are there?

» Continue Reading (about 500 words)

Continuous integration and deployment

I’ve been talking to some smart people about deployment. First a little background. One of my colleagues was working on a project that ultimately didn’t bear fruit. It was a system for continuous delivery, and involved reacting to git push by building and shipping to production. But it felt as if the problem shouldn’t be separated from provisioning, and from setting up a development environment, and so these things got folded in, and the effort became a boil-the-ocean project that had to be set aside.

» Continue Reading (about 1200 words)

How to send input to many terminals

Do you ever find yourself wanting to open several terminal windows and send the same commands to all of them? I’ve had this need many times, and I’ve never found a completely satisfactory solution. I’ve also known a lot of people who’ve written various sets of scripts to help them accomplish such tasks. In no particular order, here are a few ways I’ve done this in the past: Facebook’s pmysql client The dsh tool Several screen windows named remoteXXX, followed by a bash for-loop: while read cmd; do screen -X at remote# stuff "$cmd"; done Using many PuTTY windows and the puttycs tool Opening many tabs in KDE’s Kterm tool and selecting the options to send input to all tabs Here are some I’ve heard about, but never used:

» Continue Reading (about 300 words)

Easy on the eyes: the solarized color theme

I recently set up the solarized color theme for my terminal emulator. I’ve been meaning to do this for a while, but procrastinated. However, I finally got really frustrated with the colors I get from “ls” sometimes – I use a dark terminal with light fonts, and the directory listings in particular can become invisible, with dark blue on black. Solarized is much improved. All of the colors work well together and are easy on the eyes.

» Continue Reading (about 200 words)

Disk latency versus filesystem latency

Brendan Gregg has a very good ongoing series of blog posts about the importance of measuring latency at the layer that’s appropriate for the question you are trying to answer. If you’re wondering whether I/O latency is a problem for MySQL, you need to measure I/O latency at the filesystem layer, not the disk layer. There are a lot of factors to consider. To quote from his latest post: > This isn’t really a problem with iostat(1M) – it’s a great tool for system administrators to understand the usage of their resources.

» Continue Reading (about 200 words)

How to gather statistics at regular intervals

I gather a lot of statistics such as performance data. Sometimes I have multiple things going on a system and I want to be able to align and compare the resulting data from multiple processes later. That means they need to be aligned on time intervals. Here is a naive way to gather stats at intervals: while sleep 1; do gather-some-stats; done There are two problems: each iteration will take longer than a second, so there will be drift; and the iterations will not be aligned exactly on the clock ticks, so the data isn’t as easy to correlate with other samples.

» Continue Reading (about 200 words)

Version 1.1.8 of Better Cacti Templates released

I’ve released version 1.1.8 of the Better Cacti Templates project. This release includes a bunch of bug fixes and several important new graphs. There are graphs for the new response-time statistics exposed in Percona Server, and a new set of graphs for MongoDB. VividCortex is the startup I founded in 2012. It’s the easiest way to monitor what your servers are doing in production and I consider it far superior to Cacti.

» Continue Reading (about 300 words)

Time TCP traffic with tcprstat

I just realized that I didn’t publicize this in the Postgres world, or anywhere but the MySQL blogosphere for that matter. Some folks at my company have released a generic TCP-response-time tool. Very useful for monitoring, benchmarks, historical metrics, and so on. It’s kind of like iostat, but for TCP traffic, and fully focused on time, not traffic size. Performance == time and tasks, and this is a lightweight way to measure that data.

» Continue Reading (about 100 words)

Beware of svctm in Linux's iostat

I’ve been studying the source of iostat again and trying to understand whether all of its calculations I explained here are valid and correct. Two of the columns did not seem consistent to me. The await and svctm columns are supposed to measure the average time from beginning to end of requests including device queueing, and actual time to service the request on the device, respectively. But there’s really no instrumentation to support that distinction.

» Continue Reading (about 200 words)

A review of Web Operations by John Allspaw and Jesse Robbins

Web Operations Web Operations. By John Allspaw and Jesse Robbins, O’Reilly 2010, with a chapter by myself. (Here’s a link to the publisher’s site). I wrote a chapter for this book, and it’s now on shelves in bookstores near you. I got my dead-tree copy today and read everyone else’s contributions to it. It’s a good book. A group effort such as this one is necessarily going to have some differences in style and even overlapping content, but overall it works very well.

» Continue Reading (about 800 words)

How I keep track of notes

This is the follow-up to my post on how I keep track of tasks. It’s important for me to have a good system for keeping notes and other files organized. The problem usually turns out to be that I want them organized several different ways simultaneously: by date, by project, by person, by subject. Alas, if I keep them in files on a hard drive, I can only choose one such organizing strategy, because filesystems are a single hierarchy.

» Continue Reading (about 800 words)

How I keep track of tasks

I use a super-simple system for keeping track of tasks that are mine personally to manage. I use issue-tracking systems for software projects and consulting work, but there is still a bunch of work-related and personal work that I need to make sure I don’t forget. The main point is not to ensure that I don’t forget, actually. It is to be able to put it out of my mind with confidence that I won’t lose it.

» Continue Reading (about 1100 words)

A better way to build Cacti templates

The traditional way to build Cacti templates is through the Cacti web interface. This is an enormous amount of work, and the result is generally not very consistent or good quality. The process is too error-prone. You can export the templates as XML, but they tend to have problems such as version incompatibilities with other Cacti installations, and it’s hard to adapt them for user preferences such as different graph image sizes and polling intervals.

» Continue Reading (about 300 words)

How to read Linux's /proc/diskstats easily

These days I spend more time looking at /proc/diskstats than I do at iostat. The problem with iostat is that it lumps reads and writes together, and I want to see them separately. That’s really important on a database server (e.g. MySQL performance analysis). It’s not easy to read /proc/diskstats by looking at them, though. So I usually do the following to get a nice readable table: Grep out the device I want to examine.

» Continue Reading (about 400 words)

New Maatkit tool to compute index usage

In a couple of recent consulting cases, I needed a tool to analyze how a log of queries accesses indexes and tables in the database, specifically, to find out which indexes are not used. I initially hacked together something similar to Daniel Nichter’s mysqlidxchk, but using the framework provided by Maatkit, which gave me a pretty good start right out of the box. This was useful in the very tight time constraints I was under, but was not a complete solution.

» Continue Reading (about 400 words)

Using Aspersa to capture diagnostic data

I frequently encounter MySQL servers with intermittent problems that don’t happen when I’m watching the server. Gathering good diagnostic data when the problem happens is a must. Aspersa includes two utilities to make this easier. The first is called ‘stalk’. It would be called ‘watch’ but that’s already a name of a standard Unix utility. It simply watches for a condition to happen and fires off the second utility. This second utility does most of the work.

» Continue Reading (about 400 words)