Schrodinger's Outage

Feb 10, 2018

A couple months ago we had an incident, in which a legacy recovery mechanism proved to be inadequate to our current scale. In our internal post-incident review, we asked if we should improve this seldom-used capability. I decided not to, because the plan is to completely replace the part of the platform that it serves. My judgment was that we were not likely to need it, and it would be a lot of time and effort to improve.

Shortly thereafter, we did need it again, and again experienced the same pains. Was the decision wrong?

The DevOps Identity Crisis

Feb 7, 2015

Why DevOps needs a manifesto after all, but may never get one.

This article originally appeared on O’Reilly Radar.

DevOps is everywhere! The growth and mindshare of the movement is remarkable. But if you care deeply about DevOps, you might agree with me when I say that although its moment has “arrived,” DevOps is in serious trouble. The movement is fragmented and weakly defined, and is being usurped by those who care more about short-term opportunities than the long-term viability of DevOps.

Why Deployment Freezes Don't Prevent Outages

Nov 29, 2014

I have $10 that says you’ve experienced this before: there’s a holiday, trade show, or other important event coming up. Management is worried about the risk of an outage during this all-important time, and restricts deployments from the week prior through the end of the event.

What really happens, of course, is that the system in question becomes booby-trapped with extra risk.

The Root Cause Fallacy

Jul 21, 2014

Wouldn’t you like to find the root cause of that downtime incident? Many people would. But experience has taught me that there is no such thing as a single root cause. Instead, there’s a chain of interrelated causes, each of which is necessary but none of which is sufficient to cause the overall problem.

Amber Alert: Worse Than Nothing?

Feb 12, 2014

In the last few years, there’s been a lot of discussion about alerts in the circles I move in. There’s general agreement that a lot of tools don’t provide good alerting mechanisms, including problems such as unclear alerts, alerts that can’t be acted upon, and alerts that lack context.

Generating Realistic Time Series Data

Jan 24, 2014

I am interested in compiling a list of techniques to generate fake time-series data that looks and behaves realistically. The goal is to make a mock API for developers to work against, without needing bulky sets of real data, which are annoying to deal with, especially as things change and new types of data are needed. To achieve this, I think several specific things need to be addressed: What common classes or categories of time-series data are there?

Continuous integration and deployment

Oct 16, 2013

I’ve been talking to some smart people about deployment. First a little background. One of my colleagues was working on a project that ultimately didn’t bear fruit. It was a system for continuous delivery, and involved reacting to git push by building and shipping to production. But it felt as if the problem shouldn’t be separated from provisioning, and from setting up a development environment, and so these things got folded in, and the effort became a boil-the-ocean project that had to be set aside.

How to send input to many terminals

Oct 16, 2012

Do you ever find yourself wanting to open several terminal windows and send the same commands to all of them? I’ve had this need many times, and I’ve never found a completely satisfactory solution. I’ve also known a lot of people who’ve written various sets of scripts to help them accomplish such tasks. In no particular order, here are a few ways I’ve done this in the past: Facebook’s pmysql client The dsh tool Several screen windows named remoteXXX, followed by a bash for-loop: while read cmd; do screen -X at remote# stuff "$cmd"; done Using many PuTTY windows and the puttycs tool Opening many tabs in KDE’s Kterm tool and selecting the options to send input to all tabs Here are some I’ve heard about, but never used:

Easy on the eyes: the solarized color theme

Jul 28, 2011

I recently set up the solarized color theme for my terminal emulator. I’ve been meaning to do this for a while, but procrastinated. However, I finally got really frustrated with the colors I get from “ls” sometimes – I use a dark terminal with light fonts, and the directory listings in particular can become invisible, with dark blue on black. Solarized is much improved. All of the colors work well together and are easy on the eyes.

Disk latency versus filesystem latency

May 15, 2011

Brendan Gregg has a very good ongoing series of blog posts about the importance of measuring latency at the layer that’s appropriate for the question you are trying to answer. If you’re wondering whether I/O latency is a problem for MySQL, you need to measure I/O latency at the filesystem layer, not the disk layer. There are a lot of factors to consider. To quote from his latest post: > This isn’t really a problem with iostat(1M) – it’s a great tool for system administrators to understand the usage of their resources.

How to gather statistics at regular intervals

Mar 18, 2011

I gather a lot of statistics such as performance data. Sometimes I have multiple things going on a system and I want to be able to align and compare the resulting data from multiple processes later. That means they need to be aligned on time intervals. Here is a naive way to gather stats at intervals: while sleep 1; do gather-some-stats; done There are two problems: each iteration will take longer than a second, so there will be drift; and the iterations will not be aligned exactly on the clock ticks, so the data isn’t as easy to correlate with other samples.

Version 1.1.8 of Better Cacti Templates released

Jan 22, 2011

I’ve released version 1.1.8 of the Better Cacti Templates project. This release includes a bunch of bug fixes and several important new graphs. There are graphs for the new response-time statistics exposed in Percona Server, and a new set of graphs for MongoDB. VividCortex is the startup I founded in 2012. It’s the easiest way to monitor what your servers are doing in production and I consider it far superior to Cacti.

Time TCP traffic with tcprstat

Sep 9, 2010

I just realized that I didn’t publicize this in the Postgres world, or anywhere but the MySQL blogosphere for that matter. Some folks at my company have released a generic TCP-response-time tool. Very useful for monitoring, benchmarks, historical metrics, and so on. It’s kind of like iostat, but for TCP traffic, and fully focused on time, not traffic size. Performance == time and tasks, and this is a lightweight way to measure that data.

Beware of svctm in Linux's iostat

Sep 6, 2010

I’ve been studying the source of iostat again and trying to understand whether all of its calculations I explained here are valid and correct. Two of the columns did not seem consistent to me. The await and svctm columns are supposed to measure the average time from beginning to end of requests including device queueing, and actual time to service the request on the device, respectively. But there’s really no instrumentation to support that distinction.

A review of Web Operations by John Allspaw and Jesse Robbins

Jul 3, 2010

Web Operations Web Operations. By John Allspaw and Jesse Robbins, O’Reilly 2010, with a chapter by myself. (Here’s a link to the publisher’s site). I wrote a chapter for this book, and it’s now on shelves in bookstores near you. I got my dead-tree copy today and read everyone else’s contributions to it. It’s a good book. A group effort such as this one is necessarily going to have some differences in style and even overlapping content, but overall it works very well.

How I keep track of notes

Jul 3, 2010

This is the follow-up to my post on how I keep track of tasks. It’s important for me to have a good system for keeping notes and other files organized. The problem usually turns out to be that I want them organized several different ways simultaneously: by date, by project, by person, by subject. Alas, if I keep them in files on a hard drive, I can only choose one such organizing strategy, because filesystems are a single hierarchy.

How I keep track of tasks

Jun 30, 2010

I use a super-simple system for keeping track of tasks that are mine personally to manage. I use issue-tracking systems for software projects and consulting work, but there is still a bunch of work-related and personal work that I need to make sure I don’t forget. The main point is not to ensure that I don’t forget, actually. It is to be able to put it out of my mind with confidence that I won’t lose it.

A better way to build Cacti templates

May 25, 2010

The traditional way to build Cacti templates is through the Cacti web interface. This is an enormous amount of work, and the result is generally not very consistent or good quality. The process is too error-prone. You can export the templates as XML, but they tend to have problems such as version incompatibilities with other Cacti installations, and it’s hard to adapt them for user preferences such as different graph image sizes and polling intervals.

How to read Linux's /proc/diskstats easily

May 14, 2010

These days I spend more time looking at /proc/diskstats than I do at iostat. The problem with iostat is that it lumps reads and writes together, and I want to see them separately. That’s really important on a database server (e.g. MySQL performance analysis). It’s not easy to read /proc/diskstats by looking at them, though. So I usually do the following to get a nice readable table: Grep out the device I want to examine.

New Maatkit tool to compute index usage

May 10, 2010

In a couple of recent consulting cases, I needed a tool to analyze how a log of queries accesses indexes and tables in the database, specifically, to find out which indexes are not used. I initially hacked together something similar to Daniel Nichter’s mysqlidxchk, but using the framework provided by Maatkit, which gave me a pretty good start right out of the box. This was useful in the very tight time constraints I was under, but was not a complete solution.