Archive for the ‘indexes’ tag
Relational Database Index Design and the Optimizers. By Tapio Lahdenmaki and Mike Leach, Wiley 2005. (Here’s a link to the publisher’s site).
I picked this book up on the advice of an Oracle expert, and after one of my colleagues had read it and mentioned it to me. The focus is on how to design indexes that will produce the best performance for various types of queries. It goes into quite a bit of detail on how databases execute specific types of queries, including sort-merge joins and multiple index access, and develops a generic cost model that can be used to produce a quick upper-bound estimate (QUBE) for the execution time of a query. The book focuses on DB2, Oracle, and SQL Server, but applies equally well to MySQL and PostgreSQL.† I learned a lot from this book, and will add it to my list of essential books.
There are too many myths and rules of thumb about index design. This book debunks them pretty thoroughly. It walks the reader through the process of understanding what a database does to execute a query, and how much that costs; and then what a database does to execute a data modification, and how much that costs. Given this knowledge, you can answer questions such as “what is the ideal index for each of these two queries?” and “should the queries have separate indexes, or is it better to find a compromise that will be good for both of them?” and even “how much slower will the compromise be for each query?” In many cases, the results are non-obvious, and often don’t agree with the rules of thumb you might have been taught. Generally, the book concludes, we should use indexes much more than we often do, and we should not hold irrational fears about the cost of maintaining indexes.
After reading this book, you’ll understand what makes an index good or bad for a query (a three-star ranking system), what makes a query possible or impossible to index ideally, the quick upper-bound estimate of execution time, the Basic Question, finding the cheapest adequate index, difficult predicates, index slices, and a host of other valuable concepts. In addition, there’s an entire chapter on a method for finding queries that are not well indexed. Some of the methods in this book are things I already had notes to implement in Maatkit tools, but others are new to me. The method of finding promising culprits is something I learned in this book, and I think it’s very valuable for a tool such as mk-query-digest with the Percona enhancements to the slow query log.
There are a few things I’ll point out so it doesn’t seem like an unqualified endorsement. One, the book is not as easy to read as it could be. The editors should have removed 99% of the places where the authors italicized or otherwise emphasized words; there’s a lot of emphasis on relatively unimportant or random words. Barely a sentence is free of italics. Second, the book was written in 2005 and today’s machines have much more memory. (This generally makes the book’s points more valid, not less valid.) Finally, the cost model is based on spinning disks, and the QUBE method needs slightly different parameters to work correctly on solid-state storage, or indeed even many modern SANs. However, that’s not a big deal — just measure your storage system’s performance, plug in the correct random versus sequential access time, and the model is still valid.
† Note that although PostgreSQL does not yet support index-only queries, which is a major focus of the book, the various cost models apply equally well. One must simply account for the cost of the table access, and not assume that the index is the only thing that’s touched by the query. In general, you’re going to need to know the internals of your database server to apply this book’s wisdom.
I mentioned earlier that I’d blog about progress on the book as we go. It’s not only progress on the book itself — I want to write about the process of writing, because I think it’s very interesting and relevant to software engineering. I’m finding a lot of the work in writing a book comes from some of the same things that make software hard: coordinating work, deciding what should go where, and so on.
As I mentioned in the last article, this book is going to be much bigger than the first edition. There are places where we’re working from the first edition as a baseline, but they’re really a small part of the book. Sections have become chapters; appendices have become chapters. Topics become sections. Bulleted lists become sections too.
We (as a team) have deep expertise on a pretty broad spectrum of MySQL. Take any point in the first edition — here, I’ll open it randomly and find a page. Okay, that one was about GRANT… maybe I’ll find another one ;-) Page 68, “Index Structures”. This section in the first edition gives a couple of paragraphs to B-Tree indexes. We are probably going to write many pages and have diagrams. Not that you don’t know how B-Tree indexes work, but there are a lot of things to think about: what kinds of queries can you satisfy efficiently with them? What’s the memory cost of a B-Tree index? How can you use them to simulate hash indexes on storage engines that don’t (yet) support hash indexing? What are some useful hacks you can do? What about fragmentation, fill factor, and so on? Inserting in sorted order is a worst-case scenario in one way because it causes the most re-balancing, but does that matter overall? (As it turns out, it doesn’t — page fill factor and fragmentation trump re-balancing cost).
This kind of depth in the material is great, of course. It means you can learn about things you need to hone MySQL for a specific scenario. Though MySQL performs well as a general-purpose database server, a lot of people striving for high performance need to push the server really hard in a specific problem. Think about del.icio.us, for example. Imagine the queries they run. They’re far from general-purpose! Including specific details in such depth is very helpful for people trying to solve specific problems.
But it makes for an interesting and difficult challenge for us as authors: we have to figure out how to organize the material so you can use it. In some ways, it is a classic multiple hierarchy problem. Chapters, sections and subsections are a hierarchy. That’s the way books work, but one hierarchy can never adequately address multi-dimensional data, and MySQL is definitely a multi-dimensional topic.
Let me give you an example: we have chapters on architecture, query optimization, and schema optimization. Each of these topics has storage-engine-specific details. We can place all the details in a section titled “Engine-Specific Notes,” but then where will you go to learn about each storage engine? You’ll have to read every chapter’s notes section. We could stuff it all into a chapter called “Storage Engines,” but that chapter would hardly make sense without discussing a lot of architecture, queries, and schema optimization, would it?
Ultimately this problem is not solvable in a static book, which can only have one hierarchy. If it were a data warehouse, we could give you multiple dimensions and let you drill into the topics any way you please. In a book, the best we can do is try to arrange things where they make the most sense and seem to go with the other material the best, and then give you cross-references and a great index.
This is just one of the interesting challenges in writing that is very reminiscent of good software engineering, where code needs to be massaged into the place where it fits best. Actually, code is easier than this, because in a well-designed system, there’s usually just one best place for some bit of functionality to go. There’s usually no single best place in a book.
Working with multiple authors who have different talents and expertise also reminds me of collaborative programming, but maybe I’ll write about that another time.