Archive for the ‘SQL’ Category
Oracle released a bunch of MySQL stuff they’ve been working on since the last huge release, and my blog reader filled up with a few dozen posts I’m gonna have to read through so I don’t feel ignorant. Dear MySQL Engineering Team, could you take pity on me and release these gradually over the course of a month or so next time? Especially since Google discontinued Reader, and I’m using Feedly now, and it has a bug that I can’t figure out how to report, which result in articles being marked as read when I scroll, which makes me feel super-paranoid-insecure that I’m gonna miss an article that I scrolled over without having read yet.
Woe is me?
And to make it worse, or better, Mark Callaghan wrote a bunch of blog posts about his performance tests of the new MySQL DMR release. More to read, alas, hooray.
So, dear MySQL Engineering Team, if you can’t release things gradually, would you please publish a summary blog post or article? With blue links that turn purple when I’ve visited them? Maybe I missed it — or scrolled past it.
With chocolate and beer,
It’s common wisdom that large-scale database systems require distributing the data across machines. But what seems to be missing in a lot of discussions is distributing the query processing too. By this I mean the actual computation that’s performed on the data.
I just had a conversation with Peter Zaitsev yesterday that helped make concrete some thoughts I’ve been having about Cassandra for a while. Because Cassandra doesn’t allow you to really do any computation in the data (aggregating, evaluating expressions, and so on), if you’re going to use it for truly Big data, you’re going to fetch enormous amounts of data across the network. Sure, you’re distributing the storage and retrieval across many machines — but you’re locating your data far from your processing. You have a distant low-level key-value store, in essence, and you have to write a database wrapper on top of it if you’re going to use it for anything non-trivial.
The queries need to be sent to the data in fragments. Breaking up the query, sending fragments of them to the appropriate location close to the data, evaluating them, perhaps sending them along with partial results to further nodes and continuing, and eventually (or incrementally) assembling final results and streaming them back to the client, is a big piece of the puzzle that’s missing in many systems with similar designs. Some other systems do offer so-called distributed processing (usually in the form of a kind of map-reduce) but I haven’t seen a smart open-source implementation of it yet. By smart I mean high-performance, efficient, and generalized/generalizable, such that it has very few bad-behavior edge-cases. I’ve seen some systems that, if you have just the right data and queries, will work ok for limited use cases but fall back (with no protection) to worst-case full-cluster-scan nightmares for other types of queries.
Distributed hash tables with simple storage and retrieval aren’t enough, no matter how much frosting is applied, unless they can also do computation. Both data and queries need to be distributed in a distributed system. I think this is one reason why people continue to shard with technologies they know, such as MySQL. For specialized use cases it’s often not all that hard to write a sharded system that is optimized for the particular types of data access needed, and MySQL has pretty sophisticated abilities to do computations close to the data, in comparison with most open-source distributed key-value stores.
Continuing with my wishlist, I’ll add windowing functions. They’re enormously powerful. They allow you to extend relational logic beyond the strict boundaries of tuples. In MySQL at present, one must use ugly hacks to preserve state from one row to the next, such as user variables — which are not guaranteed to work if the optimizer changes the query plan.
And yeah, PostgreSQL and SQL Server have windowing functions too, and once you’ve used them it’s a little hard to go back. This is in fact one of the main things I hear from people who love PostgreSQL for what I consider to be legitimate reasons.
Windowing functions extend the uses of SQL (sometimes awkwardly, sometimes elegantly), into areas you can’t really go without them. Time-series data, for example, or more powerful graph processing. These things must be done externally to SQL otherwise, in ugly procedural logic.
Windowing functions together with CTEs (my previous post) are particularly powerful.
Anyone want to guess what my next wish will be?