Defining Moments in Database History
Posted in Databases on Mar 19, 2017
The rise of the LAMP stack in the early- to mid-2000s created a shift in the technology landscape, as well as the impetus for contenders to emerge. I’ve been reflecting on key factors in that phenomenon and what’s happened since then—and what it can teach us about what’s happening now.
What was it about the LAMP stack, anyway? All of the ingredients in that stack were interesting and signaled tectonic shifts (or were the result of them), but I think MySQL in particular was the bellwether for today’s database trends.
MySQL is the database that came to power much of the Internet as we know it today. MySQL was remarkable for many reasons, although it’s easy to forget them in hindsight. It wasn’t the first or perhaps even the best open source database, but it was just enough better that it became the best for the situation at hand. And ultimately it became a commercial success that even in hindsight seems improbable.
I’ve thought of many necessary conditions for MySQL to flourish in 2000-2010. The big question is which combination of those conditions were sufficient. I am not certain of the answer, but I’m certain the answer is plural.
And yet, partially because of its enormous popularity, MySQL helped spur the rise of NoSQL in 2008-2009. These databases sought to define a new moment in database history: one in which legacy relational technology would finally be replaced by an utterly new generation. The disruptor was being disrupted.
Where do we find ourselves today? Relational implementations rapidly improved (enter NewSQL), and NoSQL was backronymed to mean “not only SQL” instead of being a rejection of SQL. Many NoSQL databases today sport SQL-like languages.
Was NoSQL just a flare-up? Is there a real need for next-generation data storage and processing? Or is good old relational going to improve and obviate every next-gen data technology anyway?
I believe relational will endure, and continue to evolve to address new use cases, but is already past its heyday of complete dominance. I see a few current trends, and I’m sure that at least some of them will become equally enduring. I think we are seeing historic shifts in database technology emerge right now.
Relational, and SQL, are painful. SQL is a Yoda language that causes a lot of problems. It obscures intent, introduces illogical logic such as tri-valued truth, prompts books from the likes of Celko and Date about the small subset of how to do it right, and creates endless opportunities for the server to do things you didn’t intend and cause incredibly subtle bugs and performance disasters.
Not least, SQL is practically an open sore when it’s written in a program. Think about it: you’ve got this nice strictly-typed language with all sorts of compiler guarantees, and in the middle of it is a meaningless string blob that isn’t compiled, syntax-checked, or type-checked. It is bound to a foreign source of data through an API that isn’t knowable to the program or compiler, and may change without warning. It’s “I give up, random potentially correct garbage of dubious meaning goes here.” It’s the equivalent of an ugly CDATA in an XML document.
This should present significant opportunities for improvement. One can imagine a number of sensible first steps to take: find a way for the program and the database to use the same language and toolset; design a database query language that works similarly to a programming language; memory-map the database into the program; and so on. Problems begin immediately, and indeed the relational model was created to solve many of those issues—issues that have been happily and naively reinvented ever since. Those who are ignorant of history are doomed to repeat it.
I’ve been an avid student of emerging databases and have even been seen as a champion of some of them. A while ago I predicted that MongoDB, Redis, and Riak would survive in a meaningful way. Of these, Riak seems to have been sidelined, but MongoDB and Redis are going strong.
Which other NoSQL databases have had impact on par with those two? Perhaps Cassandra, and arguably Neo4J, but both of those are less mainstream. MongoDB and Redis are ubiquitous.
Why? It’s instructive to look at the problems they solve. Redis starts with a simple conceptual foundation: label a piece of data, then you can use the label to fetch and manipulate the data. The data can be richly structured in ways that are familiar to programmers, and the operations you can perform on these structures are a Swiss Army knife of building blocks for applications. The types of things that otherwise force you to write boilerplate code or build frameworks. Redis focuses on doing these things well, and doesn’t try to solve a lot of other problems.
Many of the NoSQL databases that sprang up like weeds in 2009 didn’t solve these types of problems in these kinds of ways. For example, Cassandra solved the scalability problem, but gave the programmer only limited expressive power. Ultimately, a highly scalable but not very “powerful” database can be less attractive than one that acts as a force multiplier for programmer productivity. To a first approximation, high scalability is a tech ops force multiplier, and devs outnumber ops tenfold.
Perhaps this is what makes Redis and MongoDB endure. I don’t know, but I am sure it’s part of what makes them a joy to use. And for better or for worse, from where I sit they seem to be the most viable answer to the proposition “a more modern database is a practical and useful thing to create.”
Another distinct emerging category is time series databases. These databases store facts with timestamps, and treat the time as a native and essential part of the data model. They allow you to do time-based analysis. Not only that, they really view temporal queries as central. Many of them even make time a mandatory dimension of any query.
I wrote extensively about time series databases previously. For example, I argued that the world is time series and I shared my requirements for a time series database a bit later. (That latter article is not something I agree with fully today).
InfluxDB is on a very steep growth trajectory as it seeks to define what it means for a database to be natively time oriented, and answer the question of whether that is enough for a database, or if there’ll be a “last mile problem” that will make people want some of the stuff they can get from other types of databases too. Defining the boundaries of a database’s functionality is hard. But InfluxDB seems to be doing an admirable job of it.
An alternative is ElasticSearch, which offers time series functionality in some ways, but not as the sole and central concept. It’s really a distributed search engine that knows about time. This quite naturally and properly raises the question: if you’re going to use a non-time-series database that knows about time, why use a search engine? Why not a relational database that has time series functionality?
There are many, many others. Time will tell what survives and what set of problems is worth solving and leaves nothing essential unsatisfied. I’d bet on InfluxDB at this point, personally. But one thing is certain: time series is important enough that first-class time series databases are necessary and worthwhile. It’s not enough to foist this use case onto another “yeah we do that too” database.
The other enduring standalone category I see today is stream-oriented, pub-sub, queueing, or messaging—choose your terminology; they’re different but related. These databases are essentially logs or buses (and some of them have names that indicate this). Instead of permanently storing the data and letting you retrieve and mutate it, the concept is insertion, immutable storage in order, and later reading it out again (potentially multiple times, potentially deleting on retrieval).
Why would you want this? It’s not obvious at first glance, but this “river of data, from which everything in the enterprise can drink” architecture is at once enormously powerful and enormously virtuous. It enables data processing patterns that otherwise require contortions and great effort, but makes them clean and easy.
The typical enterprise data architecture quickly becomes a nightmare spaghetti tangle. Data flows through the architecture in weird ways that are difficult to understand and manage. And problems like performance, reliability, and guarantees about hard things such as processing order, are prime motivators for a lot of complexity that you can solve or avoid with a queue or streaming database.
There are a lot of concepts related to these databases and their interplay with other types of database; too many to list here. I’ll just say that it’s a fundamental mindset shift, similar to the type of epiphany you get the first time you really understand purely functional programming. For example, you suddenly want to abolish replication forevermore, and you never want anything to poll or batch process again, ever.
Lots of technologies such as Spark are emerging around these areas. But in my view, Apache Kafka is the undisputed game-changer. It’s truly a watershed technology. Rather than try to explain why, I’ll just point you to the commercial company behind Kafka, Confluent. Read their materials. I know many of the people working there; they are genuine, smart, and it’s not marketing fluff. You can drink from their well. Deeply.
If anyone thought that NoSQL was just a flare-up and it’s died down now, they were wrong. NoSQL did flare up, and we did see a lot of bad technology emerge for a time. But the pains and many of the solutions are real. A key determinant of what’ll survive and what’ll be lost to history is going to be product-market fit. In my opinion, three important areas where markets aren’t being satisfied by relational technologies are relational and SQL backwardness, time series, and streaming data. Time will tell if I’m right.