Archive for the ‘writing’ tag
Whew! I just finished a marathon of revisions. It’s been a while since I posted about our progress, so here’s an update for the curious readers.
I just finished revising the last two major chapters that Peter Zaitsev hasn’t yet reviewed. Peter has been essentially going through the chapters like a very thorough technical reviewer. He makes corrections, points out where things aren’t clear or need examples, and adds more material.
By “finished revising,” I mean finished expanding the outline into a full chapter. We’re still working at the level of “this chapter is mostly there, but we might decide to revise it more.” We will most certainly do so in many cases. There are some chunks of material that I’ve marked TODO to put into other chapters, for example. We’re not at the level of a final draft with any chapter except the chapter on MySQL’s architecture, but we’re getting close with the others now.
Most of the chapters are in tech review now, and we’ve gotten a few of them back. The comments from the reviewers have been very helpful. We expanded the Replication chapter quite a bit after tech review. (And then Peter reviewed it and we expanded it even more). When the tech reviewers return comments on the other chapters, we’ll revise some more.
We’re up to 529 pages in OpenOffice.org now. At my calculated ratio of 1 page = 1.1 pages in print, that’s about 582 pages in print. And that’s not counting the Replication chapter, which doesn’t have all of its illustrations yet. I predicted we’d break 500 pages; we might get close to 600. These are very, very densely written, too. No offense to the first edition, but the tone is quite different; much less light-hearted banter, much more compressed information. Peter is a walking encyclopedia, and never seems to run out of details we really ought to include because they’re important (and they are).
We may, or may not, go to production in the next few weeks. Regardless, I think we’re still on track to have the book on shelves by the MySQL Conference & Expo in April. Look for me there. I’ll be easy to find: I’ll be the tall guy with a permanent silly grin. (You’d grin too if you finished writing a book that’s been this much work!)
I’ve posted rough outlines for many of the other chapters. The two Peter and I just finished working on are the Scaling/HA/Load-Balancing/Failover chapter, and the Application-Level Optimization chapter. The Scaling/HA chapter is pretty long and very involved, and goes into a lot of detail on scaling in particular, especially horizontal scaling via sharding. (We use “sharding” because it’s less confusing than calling it “partitioning,” which already means too many different things in databases).
The Application-Level Optimization chapter is a little shorter. It’s mostly about caching strategies, how to make a web server run well, and so on. These aren’t what the book focuses on directly, but you can either help or hurt the database server a lot with your application design. Our goal here is to help people avoid the common mistakes.
For the curious, here’s the current outline for these two chapters:
Scaling and High Availability Terminology Scaling MySQL Planning for Scalability Buying Time Before Scaling Scaling Up Scaling Out Functional Partitioning Data Sharding Choosing a Partitioning Key Multiple Partitioning Keys Querying Across Shards Allocating Data, Shards, and Nodes Arranging Shards on Nodes Fixed Allocation Dynamic Allocation Mixing Dynamic and Fixed Allocation Explicit Allocation Sidebar: Re-Balancing Shards Tools for Sharding Scaling Back Keeping Active Data Separate Scaling by Clustering Clustering Federation Load Balancing Connecting Directly Splitting Reads and Writes in Replication Changing Application Configuration Changing DNS Names Moving IP Addresses Introducing a Middleman MySQL Proxy Load Balancers Load Balancing Algorithms Adding and Removing Servers in the Pool Load Balancing with a Master and Multiple Slaves High Availability Planning for High Availability Adding Redundancy Shared-Storage Architectures Replicated-Disk Architectures Synchronous MySQL Replication Failover and Failback Promoting a Slave or Switching Roles Virtual IP Addresses or IP Takeover MySQL Master-Master Replication Manager Middleman Solutions Handling Failover in the Application
And here’s the outline for the Application-Level Optimization chapter:
Application-Level Optimization Application Performance Overview Find the Source of the Problem Look for Common Problems Web Server Issues Finding the Optimal Concurrency Caching Sidebar: Caching Doesn't Always Help Caching Below the Application Application-Level Caching Cache Control Policies Cache Object Hierarchies Pre-Generating Content Extending MySQL Alternatives to MySQL
The thing that makes me the happiest right now is that we’re clearly going to make it. For a while, there was just so much work left to do that it was impossible to estimate how much. (Ask my wife: I was wrong many times when she asked how long it would take me to finish a chapter). I also didn’t know how much revision would be necessary, which is very scary; revising takes about four times as long as writing a first draft, by my reckoning. At this point, the remaining work is much smaller, and much easier to estimate. And now I no longer flip-flop daily between “I think we can, I think we can” and “please don’t ask, because I don’t know and I want a vacation.”
Subversion shows me that Peter has the Security chapter locked right now. This one is not a huge one, and Arjen Lentz has already reviewed it as well, so I don’t expect it to be a huge amount of work to revise. After that, it’s minor chapters and appendices. (We might actually convert the chapters on Server Status and Tools into appendices, since they got cannibalized when we realized their material fit better elsewhere. They also don’t have a very chapter-ish feel; they feel more like appendices). We’ve added a few more appendices, including one on EXPLAIN and one on debugging server and storage-engine locking problems. These are all great reference material.
See you at the conference in April!
I’ve been trying to circle back and clean up things I left for later in several chapters of High Performance MySQL, second edition. This includes a lot of material in chapter 4, Schema Optimization and Indexing. At some point I’ll write more about the process of writing this book, and what we’ve done well and what we’ve learned to do better, but for right now I wanted to complete the picture of what material we have on schema, index, and query optimization. The last two chapters I’ve written about (Query Performance Optimization and Advanced MySQL Features) have generated lots of feed back along the lines of “don’t forget X!” to which I’m obliged to reply “It’s in a different chapter.”
The truth is, it’s difficult to separate these topics sensibly. I’d like to do it in the mythical “perfect” way that serializes into a nice narrative without cross-references, but even the perfectionist in me wilts under the glare of deadlines. As a result, I don’t know if it’s really possible for us to completely avoid cross-references. (I do know there’s room for improvement in how we’ve arranged the material, but I’ve spent a lot of the day today trying to de-dupe some topics we wrote about in two places, and I’m coming to appreciate that re-organizing is an extraordinary amount of work, especially in OpenOffice.org — but more on that later).
All this is a preface to the following sentence: schema, indexing, advanced features, and query optimization are intermingled to some extent in the three chapters, even though we tried to separate the topics sensibly. I haven’t yet taken some of the suggestions I got in comments on the last chapter I posted. Like I said, reorganizing is a lot of work :-)
Here’s the outline. I have the same kinds of questions as before: what are we forgetting, do you have any questions or topics you’d like us to cover, etc? Comments are welcome.
[Update: I forgot to mention the vital statistics. So far it's about 55 pages printed.]
[Intro] Choosing Optimal Data Types General Guidelines for Data Storage Smaller is Usually Better To NULL or not to NULL? Choose Identifiers Carefully How to Choose a Good Data Type Numeric Types BIT Strings String Types [sidebar: Generosity can be Unwise] BLOB and TEXT Types [sidebar: How to Avoid On-Disk Temporary Tables] Using ENUM Instead of a String Type Date and Time Types [sidebar: Watch out for automatic migration programs] Indexing Basics Types of Indexes BTREE Indexes Types of Queries that can Use a BTREE Index Indexed Column Isolation Prefix Indexes HASH Indexes Rolling Your Own HASH Indexes RTREE Indexes FULLTEXT Indexes Clustered Indexes Covering Indexes Index Scans and Using Indexes for Sorting Packed (Prefix-Compressed) Indexes Redundant and Duplicate Indexes Indexes and Locking Indexing Strategies An Indexing Case Study Supporting Many Kinds of Filtering Avoiding Multiple Range Conditions Optimizing Sorts Index and Table Maintenance Finding and Repairing Table Corruption Updating Index Statistics Reducing Index Fragmentation Normalization and Denormalization Pros and Cons of a Normalized Schema Pros and Cons of a Denormalized Schema A Mixture of Normalized and Denormalized Cache and Summary Tables [sidebar: The Principle of Faster SELECT and Slower UPDATE] Notes on Storage Engines MyISAM Memory InnoDB
Here’s a snippet of “what it’s like to write this book” that I’ll throw out there. OpenOffice.org, at least the version I’m using, doesn’t like O’Reilly’s custom heading styles and won’t show me an outline view of the document. I’m copying and pasting into this blog post by scrolling from one heading to the next. This is always enlightening, because as you can see a lot of the material isn’t organized correctly in the hierarchy. Guess what, it’s my first look at the chapter’s real outline, too! This isn’t the outline we planned to have, but the chapter evolved because of making localized changes without any real way to zoom out and make sure the outline still made sense. So my two comments on this are a) OpenOffice.org hasn’t been the most helpful tool in some ways and b) these blog posts are, to some extent, airing the project’s dirty laundry (illogical outlining, difficult separation of material among chapters, etc). I’m not afraid of that; I think it’s healthy and will help the book be better as a result. I guess my experience with open source, combined with my employer’s open-books policy, has taught me to embrace transparency instead of fearing it. In the end this material will be organized and make a lot of sense, but that’s a process of evolution — not intelligent design.
As I said, at some point I’ll write more about the process of writing. It’s been educational, and most bloggers I know who’ve written a book don’t say much about it (they just pop their heads up every now and then to apologize for not blogging). Very briefly: if you dream of writing a book, do it. It helps that my boss and co-workers support me in this venture, but it’s worth it regardless.
I mentioned earlier that I’d blog about progress on the book as we go. It’s not only progress on the book itself — I want to write about the process of writing, because I think it’s very interesting and relevant to software engineering. I’m finding a lot of the work in writing a book comes from some of the same things that make software hard: coordinating work, deciding what should go where, and so on.
As I mentioned in the last article, this book is going to be much bigger than the first edition. There are places where we’re working from the first edition as a baseline, but they’re really a small part of the book. Sections have become chapters; appendices have become chapters. Topics become sections. Bulleted lists become sections too.
We (as a team) have deep expertise on a pretty broad spectrum of MySQL. Take any point in the first edition — here, I’ll open it randomly and find a page. Okay, that one was about GRANT… maybe I’ll find another one ;-) Page 68, “Index Structures”. This section in the first edition gives a couple of paragraphs to B-Tree indexes. We are probably going to write many pages and have diagrams. Not that you don’t know how B-Tree indexes work, but there are a lot of things to think about: what kinds of queries can you satisfy efficiently with them? What’s the memory cost of a B-Tree index? How can you use them to simulate hash indexes on storage engines that don’t (yet) support hash indexing? What are some useful hacks you can do? What about fragmentation, fill factor, and so on? Inserting in sorted order is a worst-case scenario in one way because it causes the most re-balancing, but does that matter overall? (As it turns out, it doesn’t — page fill factor and fragmentation trump re-balancing cost).
This kind of depth in the material is great, of course. It means you can learn about things you need to hone MySQL for a specific scenario. Though MySQL performs well as a general-purpose database server, a lot of people striving for high performance need to push the server really hard in a specific problem. Think about del.icio.us, for example. Imagine the queries they run. They’re far from general-purpose! Including specific details in such depth is very helpful for people trying to solve specific problems.
But it makes for an interesting and difficult challenge for us as authors: we have to figure out how to organize the material so you can use it. In some ways, it is a classic multiple hierarchy problem. Chapters, sections and subsections are a hierarchy. That’s the way books work, but one hierarchy can never adequately address multi-dimensional data, and MySQL is definitely a multi-dimensional topic.
Let me give you an example: we have chapters on architecture, query optimization, and schema optimization. Each of these topics has storage-engine-specific details. We can place all the details in a section titled “Engine-Specific Notes,” but then where will you go to learn about each storage engine? You’ll have to read every chapter’s notes section. We could stuff it all into a chapter called “Storage Engines,” but that chapter would hardly make sense without discussing a lot of architecture, queries, and schema optimization, would it?
Ultimately this problem is not solvable in a static book, which can only have one hierarchy. If it were a data warehouse, we could give you multiple dimensions and let you drill into the topics any way you please. In a book, the best we can do is try to arrange things where they make the most sense and seem to go with the other material the best, and then give you cross-references and a great index.
This is just one of the interesting challenges in writing that is very reminiscent of good software engineering, where code needs to be massaged into the place where it fits best. Actually, code is easier than this, because in a well-designed system, there’s usually just one best place for some bit of functionality to go. There’s usually no single best place in a book.
Working with multiple authors who have different talents and expertise also reminds me of collaborative programming, but maybe I’ll write about that another time.