Archive for the ‘Commentary’ Category
A couple of months ago I bent the ear of a friend whose opinion I really respect. She’s a totally sharp engineer who actively writes code for a living as well as managing large teams. She’s held top-level technical roles at some large and extremely respectable companies. In short, her perspective and experience is very valuable.
One of my most important questions was what technologies she saw as established or emerging winners — good technologies to use as the foundation for a startup. I had a list of requirements I needed my technologies to meet, but I wanted to know what other requirements she thought would be important to consider. For example, the ability to hire engineers to work with the technologies.
My prime candidate for a main programming language was Go, and I was also considering Java, Scala, Clojure, and C/C++. Most languages were easy to eliminate based on my requirements. I tried to summarize my reasons for Go and against others, and asked what she thought.
In the weeks that followed, I did a lot of hard thinking, and also sought the advice of many other people. In the end, I chose Go as a main language, and so far I don’t see a reason to change that decision.
Why would I choose Go when so many factors seemingly weigh against it? Partly because it’s easier to meet many of my specific requirements in Go than it is in other languages. Meeting these requirements gives a lot of business benefits (significantly lowering the barrier to customer adoption, for example).
So regardless of the negatives, the positives for Go for my specific use case are very strong.
In addition, the usual benefits discussed about Go are turning out to be very true in my experience. You can read some articles or watch some talks on golang.org to see what those benefits are. It’s not hype; Go really is that good. Check out this great talk.
The real reason I chose Go, though, is that I took it for a test drive and found out for myself. Not just reading about it or doing code tours and walk-throughs — building systems with it. I decided to reimplement my most recent Perl program in Go and see how it went. The program does adaptive fault detection on time-series data at a fine resolution. It takes 1m16s to run on a sample dataset I use a lot. After rewriting it clumsily in Go, it runs in a few seconds. Keep in mind that I’m not a novice Perl programmer, and I don’t think my Perl program could be made much faster. This is an illustration of the execution speed difference between a scripting language and a compiled language.
That was a nice validation, but I wasn’t close to being ready to decide on Go. I spent a few weeks implementing throwaway prototypes for risky or uncertain parts of my planned system in Go, as well as writing portions of things that I knew would be humdrum turn-the-crank code. Along the way I learned a bit about designing to Go’s strengths, and started to become a little bit more productive (I am not as fast a learner as many of the people who say they’ve learned Go in a couple of weeks). I probed into things like how robust its support for MySQL client libraries is, and how easy it is to work with C or C++ libraries in case something doesn’t exist in Go but does in a C library.
I also dug a lot into the community: the mailing list, the blogs, the projects that companies build in Go. It turns out there’s a lot more adoption of Go than I thought at first. There are many major systems written with it, some of them at hot up-and-coming companies, some at older companies. It’s not just Google.
In the end, I still really appreciate the advice from my friend. It was good advice and pointed out a number of things I needed to think about more or investigate further. But you have to make your own decisions, not just follow advice. And that’s the difference between asking a friend for an opinion, and asking a friend to decide for you.
With the Big Data craze that’s sweeping the world of technology right now, I often ask myself whether we’re deficit-spending, so to speak, with our data consumption habits. I’ve seen repeated examples of being unwilling to get rid of data, even though it’s unused and nobody can think of a future use for it. At the same time, much Big Data processing I’ve seen is brute-force and costly: hitting a not-very-valuable nut with a giant, expensive sledgehammer. I think the combination of these two problems represents a giant opportunity, and I’m going to call the solution Smart Data for lack of a better word.
What’s the problem, in 25 words or less? I think it’s that we’re collecting a lot of data simply because we can. Not because we know of any good use for it, but just because it’s there.
What is the real cost of all of this data? I think we all know we’re well behind the curve in making use of it. A huge industry is springing up to try to catch up to that. It’s not a cheap industry, by and large, and I am not sure how much costs can come down. I think that we’re going to get steamrolled by this. Organizations will own a lot of data, they’ll need software and support organizations and staff to work with it, and they’ll need a lot of hardware and power to store it and compute with it. By the time the problem becomes serious and they start to backpedal, there’ll be a lot of political and psychological resistance. The costs of this resistance could be high: businesses could grow more slowly or fail, there will certainly be an environmental impact, there might be problems with security and privacy, and so forth.
That’s why I believe that if we had a crystal ball, we’d find that we don’t always need or want Big Data. We need and want what I’m going to call Smart Data.
I think most Big Data is utter garbage, collected in the hope that it might be useful, but then unused because no one can figure out whether it’s useful. I think that much of it is inherently meaningless and useless, whether we can prove it or not.
Smart Data is a recognition of this. I see it as a set of practices that I believe we need to build around the lifecycle of Big Data. Smart Data is what you get when Big Data is no longer exciting “just because.” The Smart Data lifecycle will vary, but might look like this:
- Capture and record everything, and don’t delete any of it.
- Wait for a little while until the data is ready for analysis — say, several cycles of seasonality.
- Analyze the data and determine which portions of it are meaningful, and what meaningful metrics can be distilled from it.
- Aggregate, compress, distill, extract, and otherwise winnow down the Big Data until the meaning and knowledge remains. In most cases I believe this will constitute a tiny fraction of the original dataset.
- Discard the original dataset, or place it into offline storage if you must.
- Stop retaining the whole incoming data stream. I see a variety of options here — short-term retention, upfront winnowing, realtime streaming analysis and immediate discarding, and so on.
- Repeat. If you have a new question that you think the original data can answer, but the distilled data can’t answer, then either go to your archives and pull it out, or if you’ve discarded it, start again at step 1 for a while and accumulate enough data to answer the question.
Here’s a diagram that expresses some of these ideas.
I believe that there’s an opportunity to keep the most valuable data and throw away the rest, because I’ve been able to do that in my own work. Much of the research I’ve done into MySQL performance, for example, is based on ignoring the huge stream of irrelevant data, and focusing on the signal buried in the noise. My work on extracting performance and scalability metrics from TCP traffic is an example, as is my more recent adaptive fault detection work. I’ve seen alternative implementations of similar ideas that require enormous amounts of data and very expensive computation, yet don’t appear to provide any better results than my cheap-and-easy algorithms that operate efficiently on small amounts of data. I believe this approach represents a competitive advantage for businesses in particular.
What do you think? Are we drowning in data and we don’t know it yet, or am I overreacting to a problem that doesn’t exist?
Better performance is important in everything, not just MySQL. I don’t want to wait for my text editor to open a file or perform syntax highlighting. I don’t want to wait for my version control system to compute diffs or update my copy of the code with other peoples’ changes. I don’t want to wait for my code to compile. I don’t want to wait, period.
Two tools I have enjoyed recently are Git and the Go language. Both are fast — very fast. It’s a welcome change after suffering from bzr and launchpad over the last couple years. If there is a slower or less efficient revision control program than bzr, I’m not aware of it.
Go compiles fast enough that it’s even a good scripting language. If Gentoo were written in Go, it would be no fun for its regular devotees to use — who wants to use an operating system that doesn’t give you enough time to savor that satisfying feeling of watching Xorg or LibreOffice compile for hours? You can probably tell that I don’t live for the thrill of watching “> /dev/null 2>&1” scroll up my terminal all day.