Time-Series Database Requirements

I’ve had conversations about time-series databases with many people over the last couple of years. I wrote previously about some of the open-source technologies that people commonly use for time-series storage.

Time Series

Because I have my own ideas about what constitutes a good time-series database, and because a few people have asked me to describe my requirements, I have decided to publish my thoughts here. All opinions that follow are my own, and as you read you should mentally add “in my opinion” to every sentence.

For the record, I currently have an efficient time-series database that is working well. It is built on MySQL. This is a high bar for a replacement to jump over.

Definition of Data Type

For my purposes, time-series can be defined as follows:

Workload Characteristics

Time-series data is not general-purpose and has specific patterns in its workload. A time-series database should be optimized for the following.

For writes:

For reads, the following usually holds:

The caveat to “writes arrive in sequential order” is that measurements typically arrive ordered by {timestamp, series_id}, but reads are typically done in {series_id, timestamp} order. Reads need to be fast, even though they are rare. There are generally two approaches to dealing with this. The first is to write efficiently, so the data isn’t read-optimized per-series on disk, and deploy massive amounts of compute power in parallel for reads, scanning through all the data linearly. The second is to pay a penalty on writes, so the data is tightly packed by series and optimized for sequential reads of a series.

Performance and Scaling Characteristics

A time-series database should be:

Operational Requirements

Language and/or API Design

I’ve spoken to many people who have built large-scale time-series databases for big companies. Most of them have told me that the lack of a high-level way to access and query the database was the long-term millstone around their neck.

I would be happy with something that looks like SQL, as InfluxDB’s query language does. Crucially, it needs to avoid a few of the legacy limitations of SQL. The way I think about it is that SQL tables are fixed-width and grow downwards by adding rows. A natural outcome of that is that each column in SQL statements is known in advance and explicitly named, and expressions naturally work within a single row or in aggregates over groups of rows, but cannot span rows otherwise without doing a JOIN.

Theater

However, in time-series databases, rows are series identified by the “primary key.” Rows grow sideways as new measurements are added, tables grow downwards as new series are added, and columns are timestamps. Thus, tables are sparse matrices. Expressions must operate in aggregates over rectangular sections of the sparse matrix, not just rows or columns, and the language must permit a GROUP BY functionality in both directions. You could say that both rows and columns must be addressable by keys instead of by literal identifiers, and ideally by pattern matching in addition to strict equality and ranges.

Ideally, the language and database should support server-side processing of at least the following, and probably much more:

Another way to say the above is that the language and database should be designed for analytics, not just for drawing strip charts. Many open-source time-series databases such as RRDTool are far too tightly coupled with their expected use case, and this is a serious limitation.

There should be an efficient binary protocol that supports bulk inserts.

Non-Requirements

I’d like a database that does one thing well. I do not think I need any of the following, and I regard them as neutral, or in some cases even as drawbacks:

Bonus and Additional Features

The preceding sections describe a good general-purpose time-series database, from my point of view. Nice-to-have features might include:

For my particular uses, I also need support for:

Conclusion

The future of “big data” is mostly time-series. Someone who creates a good time-series database for such use cases will probably do quite well. I’m sure my requirements aren’t the most general-purpose or complete, but I hope it’s useful to share anyway.

Pic credits:


Comments