Xaprb

Stay curious!

Archive for the ‘NoSQL’ tag

A review of MongoDB, the Definitive Guide by Chodorow and Dirolf

without comments

MongoDB, the Definitive Guide

MongoDB, the Definitive Guide

MongoDB, the Definitive Guide, by Kristina Chodorow and Michael Dirolf, 2010. About 200 pages. (Here’s a link to the publisher’s site.)

This is a good introduction to MongoDB, mostly from the application developer’s point of view. After reading through this, I felt that I understood the concepts well, although I am not a MongoDB expert, so I can’t pretend to be a fact-checker. The topics are clearly and logically presented for the most part; there is a small amount of repetition in one of the appendixes, but I don’t mind that. The writing and editing is top-notch, as I’ve come to expect from O’Reilly.

Read this book if you want to learn what MongoDB is, what it does, and how to use it. Don’t expect that you will learn everything there is to know about topics such as administration and tuning, although it’ll be a good start. (The MongoDB documentation is an excellent reference to continue your education in those areas.)

You might be pleasantly surprised at the lack of hype in this book. It wasn’t written by wide-eyed fanboys, and it does mention the weaknesses of MongoDB, although it understandably doesn’t spend any time bashing MongoDB for having shortcomings. I think you’ll get a balanced view of the database’s strengths and weaknesses, certainly enough to make a responsible decision about whether it’s worth investigating more deeply.

To sum up, as I wrote to the authors, “Nice book. Very well written, very clear and objective.”

Written by Xaprb

December 14th, 2010 at 5:44 pm

A gentle introduction to CouchDB for relational practitioners

with 11 comments

CouchDB is a document-oriented database written in Erlang that addresses a particular “sweet spot” in data storage and retrieval needs. This blog post is an introduction to CouchDB for those of us who have a relational database background.

A CouchDB database doesn’t have tables. It has a collection of documents, stored in a B+Tree. A document is a collection of attributes and values. Values can be atomic, or complex nested structures such as arrays and sub-documents. When you add a document to a database, CouchDB stores it in the B+Tree, indexed by two attributes with special meaning: _id and _rev.

CouchDB lets you store related data together even if it isn’t all the same type of data; you can store documents representing blog posts, users, and comments — all in the same database. This is not as chaotic as it sounds. To get your data back out of CouchDB in sensible ways, you define views over the database. A view stores a subset of the database’s documents. You can think of them as materialized partial indexes. You can create a view of blog posts, and a view of comments, and so on. Each view is another B+Tree. It stays up-to-date with the changes you make to the database.

You can structure your documents any way you want. There is no fixed schema. If you decide after a while that you want to add tags to your blog posts, you can simply write new posts with a collection of tags and save them into the database. Old posts won’t have tags, but that’s OK; if your application code can read the old format and write the new format, you have an application that doesn’t need a fixed schema.

Updates are never done in-place. Everything is copy-on-write. New revisions are saved into the database as new documents, obsoleting old ones, and CouchDB increments the _rev property each time. To update a document, you fetch it, change it, and send it back, specifying the _id and the most recent _rev. If someone else changed the document in the meantime, your _rev is stale, and your update fails. You must re-fetch and re-save; you can’t lock a document.

CouchDB runs on HTTP and JSON. All of its operations, such as store and retrieve, are standard HTTP requests. The documents themselves are represented in JSON. You can talk directly to CouchDB with curl, Ajax, and anything else that can speak HTTP. There is no “protocol” other than this. CouchDB isn’t just Web-friendly, it is actually made of the same technologies that the Web is made of. You query CouchDB by specifying the database, document ID, view name, and so forth directly in the URL. For example, to fetch a blog post document from the “blog” database, you might issue a GET /blog/helloworld. Queries against views and other objects have simple clean URLs, too.

CouchDB uses special documents, called “design documents,” to store JavaScript code in the database. The code defines the views I mentioned earlier. Another thing you can store is validation functions. This is code that CouchDB executes when you save a document to the database. It accepts a document as input, and can reject it, so you do have control over the schema of documents — it doesn’t have to be a free-for-all. In the blog application, you can have a validation function that starts by enforcing “every document must have a ‘type’ property, and its content must be one of (post,user,comment).” Then you can have separate validation logic for each type of document.

Design documents can also contain something called “show functions.” CouchDB will execute the function’s code in response to HTTP requests to that URL, and send the resulting data back as an HTTP response (as usual). With show functions, you can store entire applications inside the database. Your browser might never even know that it’s talking to a database directly, instead of a web server with a database behind it.

CouchDB isn’t designed for arbitrary queries at runtime. You can only query one view, show function, or database at a time. You can’t do joins. You can’t do arbitrary GROUP BY and ORDER BY. You have to decide in advance what operations you’re going to need, and build views for them. You can then issue requests to those views, essentially the equivalent of key lookups and range scans with a few basic options such as an offset, limit, and reverse order. Now, having said that, you can define views that reduce the database down to aggregates, create a custom ordering, and so on. You can define the equivalent of the relational “project” operation inside your view code.

Here’s how: a view is a map-reduce operation. A view is defined in two parts, the map and the reduce. The map is not optional; it generates the contents of the view. It is a JavaScript function. CouchDB iterates over the database and feeds each document into the function, collects the results, and inserts them into the view’s B+Tree index. Inside the view function’s code, you emit key-value 2-tuples.

  • The key will identify the tuple in the index that’s built to store this view. It can be simple or complex, so you can create a view that’s keyed by [this,that,the_other_thing]. The view will be ordered by the same thing; that’s how B+Trees work.
  • The value you emit is whatever you want the B+Tree to store at its leaf nodes, and can also be complex (it’s a document, like any other).

The “reduce” part of the operation is optional. It computes what is stored in the non-leaf nodes of the B+Tree index. For example, you can use it to create aggregates, such as summing up counts of comments. In addition to the reduce part of the code, there is a “rereduce”. The rereduce is called as the operation is invoked on higher and higher non-leaf nodes, all the way to the root of the tree. CouchDB knows how to take advantage of the data that’s stored by these reduce and rereduce operations, so for example, it doesn’t necessarily have to descend all the way to the leaf nodes and scan in order to count how many documents match a particular query.

An important thing to know about all this code is that nothing is allowed to have side effects. You can’t modify the database in a view definition, for example. Documents are immutable; it’s all copy-on-write. You get input; you can specify output; that’s it, period. It’s a form of functional programming. Why do we care? Because it keeps things simple and elegant, and enables all kinds of nice properties and functionality, such as replication and eventual consistency and cache expiry and scaling to multiple nodes and so on.

The database file is append-only. Old versions don’t automatically get cleaned up. The database grows forever until you compact it. This process builds a new database and then does a swap-and-discard. The append-only, copy-on-write design makes backups easy, and data corruption unlikely.

CouchDB comes with a “graphical user interface” called Futon. It’s built right into the database, and surprise! — it works through HTTP and Ajax. You just fire up CouchDB, point your Web browser to /_utils, and go. It’s a fun way to explore CouchDB.

With all that in mind, why would you want to use CouchDB instead of a relational database? For most things I’m involved with, I want a relational database. But I got asked recently to help with a database that’ll store records about people. Although nobody has implemented anything yet, it’s a terrible match for a relational database, and an excellent fit for a document-oriented one. The inputs are going to be arbitrary documents with different structures, such as census records, birth records, tax records, estate and probate records, marriage records, and so on. Nobody knows what it’s going to store in the future. When people build “flexible schemas” in relational databases, they usually go for the so-called EAV or EBLOB models. In other words, they aren’t using the database relationally at all, and it simply doesn’t work well. This type of project needs a document-oriented database.

I’ve left out a lot of important details, but the point of this post is to understand the high-level CouchDB concepts and how they’re implemented, so you can reason for yourself about it. If you’ve read this far and you think that CouchDB might be a good fit for your needs, I encourage you to take a look at CouchDB, The Definitive Guide.

Update: CouchOne adapted the above post as a white paper and guest-posted it on their own blog.

Written by Xaprb

September 7th, 2010 at 5:23 pm

Posted in CouchDB,SQL

Tagged with

A review of Web Operations by John Allspaw and Jesse Robbins

with one comment

Web Operations

Web Operations

Web Operations. By John Allspaw and Jesse Robbins, O’Reilly 2010, with a chapter by myself. (Here’s a link to the publisher’s site).

I wrote a chapter for this book, and it’s now on shelves in bookstores near you. I got my dead-tree copy today and read everyone else’s contributions to it. It’s a good book. A group effort such as this one is necessarily going to have some differences in style and even overlapping content, but overall it works very well. It includes chapters from some really smart people, some of whom I was not previously familiar with. John and Jesse obviously have good connections. A lot of the folks are from Flickr.

Here are the highlights in my opinion.

  • Theo Schlossnagle, who has a place on my list of essential books, opens things with an overview of what web operations really is, and why it’s hard. Don’t skip this. Theo’s introduction is concise and thoughtful.
  • Eric Ries discusses the benefits of continuous deployment. He is right on the money. Right out of college I spent 3 years as a developer at a company with very little engineering discipline, and then left for another company built by a small ace team practicing extreme programming. Eric nails the benefits of continuous deployment — he really gets it. I hadn’t heard of Eric before, but now I’ve subscribed to his blog.
  • John Allspaw (whose book on capacity planning is also on my list of essentials) and Richard Cook discuss how complex systems fail. This chapter appeared in part as a whitepaper and blog post on John’s blog, and is expanded in this book. I have spent a lot of time examining failures for clients, and as VP of Consulting, also a lot of time examining Percona’s own mistakes. I fully agree with the conclusions in this chapter. A few key points: there is never a single root cause; our desire to find one blinds us and keeps us from learning; true failures are inherently unpredictable and happen only when a series of things fails; avoiding failure requires experience with failure. This echoes another book I’ve read recently, The Black Swan.
  • Brian Moon’s chapter on unexpected traffic spikes. If you get a chance to hear Brian speak, take it. He’s an engaging guy with interesting and relevant stories to tell. Stories are always a better experience than bullet points.
  • Jake Loomis’s chapter on postmortems. My own research into prevention of emergencies agrees almost perfectly with his list of things to do on page 225. Read this chapter carefully! Now, knowing how to put this into action is hard — very hard — but at least you’ll have a place to start. The worst compliment I ever got after fixing a system that’d run out of hard drive space (due to utter lack of basic monitoring) was that I’d “saved the day.” Baloney. Postmortems can be a great way to learn your infrastructure’s weaknesses and prevent emergencies in the future. I’m fully confident that this particular client will again deploy new servers without adding them into Nagios, and the results will be predictable.
  • Naturally, my chapter about choosing a relational database architecture for web applications (skewed towards MySQL). There is a chapter on NoSQL databases by Eric Florenzano as well, but it is more introductionary-level.

What wasn’t so good? I didn’t get a lot of value out of John’s interview with Heather Champ, on community management and web operations. I did not think the interview format worked well in a book full of essays. But that might just be me. Also, a couple of places in two or three chapters felt a bit rant-ish without a lot of clear actionable advice; I think readers won’t get so much out of this.

Overall, though, this is a great book, badly needed, on a topic that is simply not yet recognized for its true importance. As Theo writes, we’re seeing the emergence of web operations as a very large profession; it’s one whose definition is not yet formalized or agreed-upon, but that’ll change. It’s too important not to. Jesse’s introduction repeats this sentiment: the world now relies on the web, and so the world relies also on the engineers who make it run. Web operations is work that matters.