Archive for the ‘indexing’ tag
Relational Database Index Design and the Optimizers. By Tapio Lahdenmaki and Mike Leach, Wiley 2005. (Here’s a link to the publisher’s site).
I picked this book up on the advice of an Oracle expert, and after one of my colleagues had read it and mentioned it to me. The focus is on how to design indexes that will produce the best performance for various types of queries. It goes into quite a bit of detail on how databases execute specific types of queries, including sort-merge joins and multiple index access, and develops a generic cost model that can be used to produce a quick upper-bound estimate (QUBE) for the execution time of a query. The book focuses on DB2, Oracle, and SQL Server, but applies equally well to MySQL and PostgreSQL.† I learned a lot from this book, and will add it to my list of essential books.
There are too many myths and rules of thumb about index design. This book debunks them pretty thoroughly. It walks the reader through the process of understanding what a database does to execute a query, and how much that costs; and then what a database does to execute a data modification, and how much that costs. Given this knowledge, you can answer questions such as “what is the ideal index for each of these two queries?” and “should the queries have separate indexes, or is it better to find a compromise that will be good for both of them?” and even “how much slower will the compromise be for each query?” In many cases, the results are non-obvious, and often don’t agree with the rules of thumb you might have been taught. Generally, the book concludes, we should use indexes much more than we often do, and we should not hold irrational fears about the cost of maintaining indexes.
After reading this book, you’ll understand what makes an index good or bad for a query (a three-star ranking system), what makes a query possible or impossible to index ideally, the quick upper-bound estimate of execution time, the Basic Question, finding the cheapest adequate index, difficult predicates, index slices, and a host of other valuable concepts. In addition, there’s an entire chapter on a method for finding queries that are not well indexed. Some of the methods in this book are things I already had notes to implement in Maatkit tools, but others are new to me. The method of finding promising culprits is something I learned in this book, and I think it’s very valuable for a tool such as mk-query-digest with the Percona enhancements to the slow query log.
There are a few things I’ll point out so it doesn’t seem like an unqualified endorsement. One, the book is not as easy to read as it could be. The editors should have removed 99% of the places where the authors italicized or otherwise emphasized words; there’s a lot of emphasis on relatively unimportant or random words. Barely a sentence is free of italics. Second, the book was written in 2005 and today’s machines have much more memory. (This generally makes the book’s points more valid, not less valid.) Finally, the cost model is based on spinning disks, and the QUBE method needs slightly different parameters to work correctly on solid-state storage, or indeed even many modern SANs. However, that’s not a big deal — just measure your storage system’s performance, plug in the correct random versus sequential access time, and the model is still valid.
† Note that although PostgreSQL does not yet support index-only queries, which is a major focus of the book, the various cost models apply equally well. One must simply account for the cost of the table access, and not assume that the index is the only thing that’s touched by the query. In general, you’re going to need to know the internals of your database server to apply this book’s wisdom.
It’s been a while since I’ve written about progress on the book. I actually stopped working on it as much at the beginning of the month, because on October 31st I managed to finish a first draft of the last big chapter (Scaling and High Availability)! Now I’m back to full-time work at my employer, and I’m working on the book in the evenings and weekends only.
This doesn’t mean the book is close to being done, though. The editor is sending out some chapters for technical review, and there’s still a lot more writing and revising to be done.
Last weekend I revised the Security chapter from the first edition, which I think will be the only chapter that we’ll just revise and update, rather than completely rewriting (well, maybe the Architecture chapter could be considered a revision instead of a rewrite, but it’s a stretch; we changed it a lot). I removed a lot of the material that repeated the MySQL manual, and added a lot of information and best practices on grants, new privileges and objects in MySQL 5, common tasks, common mistakes, and so on. The chapter ended up being nearly as long, even though I stripped out all the code listings and so on from the first edition (in fact, I reduced the first edition’s material to a few paragraphs).
Beyond that, though, there are little details to finish out in many of the chapters. Examples that need to be finished, figures that need to be re-drawn, material that doesn’t quite fit and needs to be re-arranged or even moved to another chapter; it’s a lot of work. Peter Zaitsev has been reviewing some of the core chapters on query and schema optimization etc, and I’m revising them in response to his comments. That’s what I spent today doing.
I think the biggest chunks of work that remain are going to be making chapters 3, 4, 5, 6 and 7 (benchmarking, profiling, schema, indexing, query optimization, advanced features, and server tuning) flow together well. The challenge here is how to organize the vast amount of material so it reads well, without too many forward references, and still be useful as a reference work. The detail we’ve gone into is incredible. It makes it very hard to find the single best place to mention each little bit of wisdom, because all of this material is completely inter-related. It’s tough to flatten the graph of knowledge into a one-dimensional narrative.
It’s not just these chapters that have a lot of inter-related material, of course. It’s hard to talk about tuning the server settings (chapter 7) without bringing the OS and hardware (chapter 8) into it, and whenever you do this you also need to think about measuring and monitoring status information (chapter 14). Of course, you need to do that for benchmarking and profiling, too (chapter 3). I’m sure you see the dilemma!
The good news is, if we succeed in doing this well, you will find the book enormously useful. Stay tuned!
I’ve been trying to circle back and clean up things I left for later in several chapters of High Performance MySQL, second edition. This includes a lot of material in chapter 4, Schema Optimization and Indexing. At some point I’ll write more about the process of writing this book, and what we’ve done well and what we’ve learned to do better, but for right now I wanted to complete the picture of what material we have on schema, index, and query optimization. The last two chapters I’ve written about (Query Performance Optimization and Advanced MySQL Features) have generated lots of feed back along the lines of “don’t forget X!” to which I’m obliged to reply “It’s in a different chapter.”
The truth is, it’s difficult to separate these topics sensibly. I’d like to do it in the mythical “perfect” way that serializes into a nice narrative without cross-references, but even the perfectionist in me wilts under the glare of deadlines. As a result, I don’t know if it’s really possible for us to completely avoid cross-references. (I do know there’s room for improvement in how we’ve arranged the material, but I’ve spent a lot of the day today trying to de-dupe some topics we wrote about in two places, and I’m coming to appreciate that re-organizing is an extraordinary amount of work, especially in OpenOffice.org — but more on that later).
All this is a preface to the following sentence: schema, indexing, advanced features, and query optimization are intermingled to some extent in the three chapters, even though we tried to separate the topics sensibly. I haven’t yet taken some of the suggestions I got in comments on the last chapter I posted. Like I said, reorganizing is a lot of work :-)
Here’s the outline. I have the same kinds of questions as before: what are we forgetting, do you have any questions or topics you’d like us to cover, etc? Comments are welcome.
[Update: I forgot to mention the vital statistics. So far it's about 55 pages printed.]
[Intro] Choosing Optimal Data Types General Guidelines for Data Storage Smaller is Usually Better To NULL or not to NULL? Choose Identifiers Carefully How to Choose a Good Data Type Numeric Types BIT Strings String Types [sidebar: Generosity can be Unwise] BLOB and TEXT Types [sidebar: How to Avoid On-Disk Temporary Tables] Using ENUM Instead of a String Type Date and Time Types [sidebar: Watch out for automatic migration programs] Indexing Basics Types of Indexes BTREE Indexes Types of Queries that can Use a BTREE Index Indexed Column Isolation Prefix Indexes HASH Indexes Rolling Your Own HASH Indexes RTREE Indexes FULLTEXT Indexes Clustered Indexes Covering Indexes Index Scans and Using Indexes for Sorting Packed (Prefix-Compressed) Indexes Redundant and Duplicate Indexes Indexes and Locking Indexing Strategies An Indexing Case Study Supporting Many Kinds of Filtering Avoiding Multiple Range Conditions Optimizing Sorts Index and Table Maintenance Finding and Repairing Table Corruption Updating Index Statistics Reducing Index Fragmentation Normalization and Denormalization Pros and Cons of a Normalized Schema Pros and Cons of a Denormalized Schema A Mixture of Normalized and Denormalized Cache and Summary Tables [sidebar: The Principle of Faster SELECT and Slower UPDATE] Notes on Storage Engines MyISAM Memory InnoDB
Here’s a snippet of “what it’s like to write this book” that I’ll throw out there. OpenOffice.org, at least the version I’m using, doesn’t like O’Reilly’s custom heading styles and won’t show me an outline view of the document. I’m copying and pasting into this blog post by scrolling from one heading to the next. This is always enlightening, because as you can see a lot of the material isn’t organized correctly in the hierarchy. Guess what, it’s my first look at the chapter’s real outline, too! This isn’t the outline we planned to have, but the chapter evolved because of making localized changes without any real way to zoom out and make sure the outline still made sense. So my two comments on this are a) OpenOffice.org hasn’t been the most helpful tool in some ways and b) these blog posts are, to some extent, airing the project’s dirty laundry (illogical outlining, difficult separation of material among chapters, etc). I’m not afraid of that; I think it’s healthy and will help the book be better as a result. I guess my experience with open source, combined with my employer’s open-books policy, has taught me to embrace transparency instead of fearing it. In the end this material will be organized and make a lot of sense, but that’s a process of evolution — not intelligent design.
As I said, at some point I’ll write more about the process of writing. It’s been educational, and most bloggers I know who’ve written a book don’t say much about it (they just pop their heads up every now and then to apologize for not blogging). Very briefly: if you dream of writing a book, do it. It helps that my boss and co-workers support me in this venture, but it’s worth it regardless.