Xaprb

Stay curious!

How to unit-test code that interacts with a database

with 15 comments

I got some interesting comments on my previous article about unit testing Maatkit, including echoes of my own conversion to the unit-testing religion. One of the objections I’ve heard a lot about unit-testing is how it’s impossible to test code that talks to a database. “It’s too hard,” they say. “Oh, it’s easy to test a module that calculates a square root, but a database? Way too much work!”

Note: As commenters have pointed out, I’m not necessarily using “unit” in the agreed-upon way here. Everything I say can be applied to ultra-pure unit testing too, but I go beyond that. I will hold fast to my assertions about mocking though *grin*

Is it really impossible or even hard?

I disagree. In one of my previous articles I said The Rimm-Kaufman Group, my previous employer, has a comprehensive unit-test suite. When I say comprehensive I mean it: database interaction is fully tested, too. I know because I was heavily involved in building it. Even extremely complex things like big reports that are generated from lots of data are tested. And believe me, sharding the databases would have been much harder without complete code coverage. It’s really not that complicated to unit-test against a database, and it’s so worth it. Here are some hints about how you can do this.

There are many ways to do it, but I’ll just describe the basics of the system I helped build. There are several moving parts to the test suite (”smoke“), but one of them sets a magical environment variable. And then, all code that connects to a database server magically gets back a different database connection from the create_me_a_connection() function. This is because there is a database connection abstraction library that respects the environment variable. It’s really pretty simple for the most part; instead of doing DBI->connect(…) you just call this function, which is a thin wrapper that hands back a connection object.

This wrapper is itself unit-tested thoroughly, too. This ensures that when some code is being run from a test, it cannot (I mean cannot!) connect to a production database, and vice versa. There are some conventions about production and test servers that make sure the abstraction library can tell for sure. If there’s any confusion, of course, it will die in a non-recoverable way. Safety first.

Building a good development environment

Just as each developer has their own copy of the code from version control, each developer has their own private database server running on the dev machine. There are some simple conventions that make this possible: Unix user ID plus a constant for the port number, etc. It’s really quite easy. The private database server is a slightly modified version of Giuseppe Maxia’s MySQL Sandbox tool. It can be torn down and set up afresh as desired. It is wiped clean and re-filled at the start of every test, with a small, tightly focused dataset carefully chosen to represent the conditions the code is supposed to work with. (Each test has its own dataset).

If this sounds like a system that can’t work on a large scale, well, it does. That’s the secret sauce that I won’t reveal in this post. (It’s my past employer after all, and I can’t go revealing everything about them can I?) You just have to be smart about it. When a database is central to your business, you either figure out how to get this right, or you pay the consequences in lost time and poor code quality.

I and the other developers there (another secret: it’s a small team; small teams build great things) built several quick utilities to help develop unit tests against a database. There are utilities to get a minimal necessary dataset for testing and dump it into a file that can be loaded by the test. There are utilities that can migrate schemas and update the tests to match the schema changes. And so on, and so on. This is possible because of careful planning for testability, and really smart things like super-consistent and sensible naming conventions for database objects. (Ruby On Rails owes a lot of its success to simple things like this, too. Conventions are really powerful.) Maybe I’ll write about the database naming conventions some other time — I have to credit Alan Rimm-Kaufman a lot for designing those conventions. It was a stroke of genius.

Things to avoid

There are several things I do not recommend doing when you unit-test code that talks to a database. I’ll just mention a couple:

    Don’t mock anything! In general I think mocking is the devil. Most of the mock objects I’ve ever seen reflected a propensity to test an implementation instead of a behavior, which is also the devil. Write all your code to test a test instance of something real, and do not mock up a database to test against. It is a rabbit-hole that you will not emerge from easily.
  • Never let a test connect to a production database. Never, ever. Worlds of hurt will follow. Not only are you risking your production data, but what about the risk to your code? You’re testing against things that will almost certainly change and break your tests; and you’re possibly polluting your live data with testing data and/or changing live data from the tests.
  • I also recommend developing unit tests for your current database functionality if you’re thinking about changing it much. Don’t like MySQL’s lax error handling? Plan to set the SQL_MODE to something stricter? Dive into that database abstraction library and make your tests run in strict mode first by setting SQL_MODE on every new connection that’s created when running inside a test; fix all the breakage in the test suite; feel sure that your code isn’t going to break in production. That was easy!

Summary

Once your creative juices get flowing, you’ll see tons of places your unit test suite can help you out.

If you’re in the Oracle or SQL Server world, or any other world where you can’t just set up and discard database instances at will due to licensing problems, you’re going to have to be a little more inventive. But you can still do it. (Don’t you wish you’d chosen Freedom?) And unit tests are just as beneficial for apps based on Oracle as they are for MySQL.

Have fun! Go forth and test some more!

Written by Xaprb

August 19th, 2008 at 8:47 pm

15 Responses to 'How to unit-test code that interacts with a database'

Subscribe to comments with RSS or TrackBack to 'How to unit-test code that interacts with a database'.

  1. During my last project, I tried to write a test suite as comprehensive as I could.

    I ended up using (more or less) the solution that you mention (private database server). I also found that mocking a whole database was not worth the trouble and was hiding a lot of issues.

    Still, there are two things that I’d like to add:

    - Strictly speaking, this is not unit-testing anymore
    - The tests should let the database the way it was before running them

    http://en.wikipedia.org/wiki/Unit_testing

    > In order to test [a class that depends on a database], the tester often writes code that interacts with the database. This is a mistake, because a unit test should never go outside of its own class boundary

    http://www.theserverside.net/discussions/thread.tss?thread_id=23702

    > Design your tests in such a way that they are fully atomic. That is, when the test is done, the database should be either exactly the way it was, or in a state that will not affect the results of later tests.

    Norikazu

    20 Aug 08 at 4:38 am

  2. 1. This is not unit testing, but integration testing.
    2. If you wan’t to unit test database interaction code you have to simulate the output stream from your db. Mocking is needed here!
    3. Sql Server and Oracle don’t require licenses for databses running on one machine. I hope you don’t need multiple machines to run you db just for running your tests?
    4. MySQL has nothing to do with the free software foundation. And most people do buy lisences for MySql.

    Alexander

    20 Aug 08 at 7:31 am

  3. I agree with the other comments that this should not be considered unit testing. Unit tests are about the logic, and testing all the combinations. Having to setup all that data, then load it/scrub it with each test is contrary to getting fast quick checks of the logic of the system. Databases should be “mocked” out, and there are different designs that allow for better testing that J2EE’s default way.

    I would add that using (For Java) DBUnit to do the tests of the database is a great idea, not for its JUnit framework but the API itself. You can also write a quick utility that captures the data set into a file to allow reasonably easy generation of integration data sets.

    Paul Keeble

    20 Aug 08 at 8:34 am

  4. Now from a practical standpoint…

    Just “mocking” out the database isn’t going to do it for you… at some point in your SQL code you’re going to have a data dependent conditional or case statement. If your unit tests are thorough, you’re going to have a separate test for each path. You’ll need the data to drive down those paths or your tests aren’t going to work.

    The creation of your test data (or researching to make sure your real data set has enough data to test everything) is an important part of the design/development process. It’ll really get you to thinking about how your application and your data interact. There have been more than once instances where I’ve realized that I had an unspoken for data condition in my original design.

    The utPLSQL framework implements a simple ut_setup and ut_teardown function for just this purpose. Designing the test might not be so quick because you have to have all the supporting data for the test. But the actual running of the test is fast and efficient.

    Gregory Haase

    20 Aug 08 at 9:05 am

  5. Norikazu,

    “… found that mocking a whole database was not worth the trouble and was hiding a lot of issues.” Exactly! Every bit of code you mock is a test that is potentially WRONG. You are not testing against reality. Mocking is deeply flawed as a testing methodology in my opinion. I have never seen it cause anything but trouble.

    “Strictly speaking, this is not unit-testing anymore” OK. I’m probably being a little loose with the terminology. But one man’s unit is another man’s module is another man’s whole system. I have a bit of code that takes some input and produces some output; I write a test that creates an object and calls it with the input; I get the output. Black box; behind that box is thousands of lines of code and lots of database interaction. To me, it quacks like a unit test, so it must be one, at a high level. I’m testing the Report::Weekly::GeneratePerClient unit. Who cares if there’s more than one unit inside it? It’s all recursive unit testing in my view, and I don’t draw a distinction between testing at the bare-metal level and testing at a higher level. I want to guarantee that the code works.

    “The tests should let the database the way it was before running them” Yep. That’s why I said it’s torn down and set up before each test. Each test, not each smoke run. That means each test gets a fresh DB instance with pristine data. That’s the part that you have to be clever about.

    “In order to test [a class that depends on a database], the tester often writes code that interacts with the database. This is a mistake, because a unit test should never go outside of its own class boundary” You know, that’s a very OO-centric view of the world. Objects are not the be-all and end-all.

    Alexander,

    “If you wan’t to unit test database interaction code you have to simulate the output stream from your db. Mocking is needed here!” That sounds like an artificial constraint to me. Why mock a database when you can just, uhm, USE A DATABASE? Again: mocking is the DEVIL. Who made the rule that you have to simulate something you want to test against? You ever see a really comprehensive test suite written this way, mocking the input and output of the entire DB and covering all code paths? I tried that once. Nightmare, nowhere near comprehensive, impossible to do; the mocking ended up being more code than the code itself. The goal here is intensely practical: test the code, all the code, and don’t write any tests or code that’s unnecessary. A mock is unnecessary, and because it’s not reality, it’s WRONG. Unless it is indistinguishable from a database, and that means it IS a database. I’m not going to go reinventing the database.

    “Sql Server and Oracle don’t require licenses for databses running on one machine. I hope you don’t need multiple machines to run you db just for running your tests?” Nope. But if you have a per-instance Oracle license, you better be careful about just casually giving each developer a private instance. There are other ways to do it — you just have to make sure nobody interferes with anyone else’s work.

    “MySQL has nothing to do with the free software foundation. And most people do buy lisences for MySql.” MySQL would love it if that were true. As it is, a) it’s GPL licensed, which is why I linked it with the word Freedom; b) I’d like to see your proof about people buying licenses. I’m a consultant and I don’t see “most” people buying licenses. I see a lot of MySQL users.

    Paul Keeble:

    “Having to setup all that data, then load it/scrub it with each test is contrary to getting fast quick checks of the logic of the system.” That’s why I said you have to be smart to do this. RKG has a very smart team.

    Show of hands, who has thoroughly tested a large complex system that interacts with a large complex database? I can tell Gregory Haase has!

    Xaprb

    20 Aug 08 at 9:40 am

  6. If you plan on using a live (development) database instead of a mock, transactions can be your friend.

    Simply create a wrapping transaction at connection creation and roll it back at connection tear-down. Any database that properly supports nested transactions will allow the testing of not only “bare” data access, but the correct implementation of triggers, materialized view updates, rollbacks on “errors” and other esoterica that are being tested, directly or indirectly.

    With MySQL, this of course means you have to be “clever” if you want to test transactional database updates, as it doesn’t support nesting to my knowledge (at least it didn’t last time I looked: nesting transactions was an error).

    John Lopez

    20 Aug 08 at 12:02 pm

  7. Xaprb: You might as well drop the word “unit” from your article. What you describe is not unit testing. You can’t just take terms with an agreed upon meeting and re-use them how you like, and not expect people to call you on it.

    One of the main tenets of unit testing is that you do NOT cross class boundaries. Your class should define a contract with the outside world. Your unit tests test against this contract. This results in flexible code, where anybody that wants to step in and fulfill one end of the contract can do so. This results in flexible, loosely coupled code, and is also the reason why mocks make any sense at all. When your interface between units is defined in terms of an interface, it’s trivial to replace a “real” implementer of that interface with a pre-programmed mock. You’re not testing implementation, you’re testing your object-under-test’s ability to fulfill its end of the interface contract.

    With all that being said, in the narrower instance of integration testing code that talks to a database, the solution you suggest is quite appropriate. Mocking out a database is for masochists only, and having automated teardown/creation scripts is a good solution to the “pristine database” issue, simpler to manage and more fool-proof than trying to undo your actions after the fact.

    It’s also possible to push the ACTUAL interaction with the database as far to the back as possible, by utilizing a Repository classes that perform only simple, atomic DB operations (CRUD, along with managing parent/child relationships), and return back collections of business objects, with the app/service->DB interface defined in terms of operations on these objects. In this scenario, the repository side of the interface can be mocked easily. All the details of SQL (or whatever) vanish behind the interface. Since the mock repository is essentially creating the business objects out of thin air, this forces you to focus on YOUR object’s behavior, known inputs and expected outputs, and prevents you from becoming dependent on certain conditions in the database for your tests to behave properly.

    So again, once we properly define our terms, you make some good points. I just think that by confusing your terminology, you’re causing people to focus on those issues rather than the ones you’re trying to discuss.

    Oh, one more thing: blanket labelling a tool as “the devil”, rather than recognizing that there are appropriate and inappropriate uses for it, suggests a lack of critical thinking about the issue. A path to the dark side, this is.

    JustinB

    20 Aug 08 at 12:04 pm

  8. John Lopez, unfortunately this doesn’t work with MySQL: DDL is not transactional. Alas. JustinB: point taken, note added to top of article about loose use of “unit” testing. (Although there is a lot of code in RKG that does fit the strict definition of not crossing class boundaries, too.)

    Xaprb

    20 Aug 08 at 12:37 pm

  9. “But if you have a per-instance Oracle license, you better be careful about just casually giving each developer a private instance.”

    Oracle, SQL Server and DB2 all now have “Express” editions that are free for all purposes (even commercial production), with the limit that your database must be smaller than 4GB. Don’t just give each developer a private instance; let them run the server on their own computer.

    Bill

    20 Aug 08 at 12:44 pm

  10. Accessing a database is not actually input and output. Instead what normally happens is we rely on side effects, where calls to methods affect other calls and we defend ourselves somewhat by using transaction boundaries. Of course all OO development is based on side effects, but the difference with classes that access databases is that they have state which is not close to the data. Anything with such potentially wide spread unit tests is hard to test. Anything that requires global data is equally difficult to test, and database connections work this way also.

    Don’t get me wrong, mocks are bad idea as well. The whole design we have for enterprise systems today is normally a bad idea, but its better than the known alternatives. There are however other architectures which don’t rely so much on side effects and don’t require the call the database to happen as part of the call graph to the business logic. I would advocate not following the service interface that is so popular with J2EE and RoR because it is seriously hard to test in any large system.

    If you’ve designed your app like that then your only choice is to call onto the database because the sheer effort of doing anything else precludes other approaches. The best you can do is try to reduce the impact and use an in memory databases and transactional roll-back and keep your application database agnostic. But if you go the full hog with stored procedures etc then a world of pain is inevitable. This is why very few applications have decent unit tests.

    Paul Keeble

    20 Aug 08 at 1:32 pm

  11. Paul: One of my favorite sayings: “Any problem in computer science can be solved with one additional layer of indirection. But that usually will create another problem.” - David Wheeler

    At some point, unit testing has to break down. You will eventually hit a point where your classes are talking to some outside ecosystem that you didn’t write or don’t control, don’t have an interface for, etc. Often, this is your database.You can’t avoid this. What you can do, is add another layer of indirection between your class and this data source. The goal here is to have the interface for the “last mile” so simple that you can, in good conscience, leave it un-unit-tested.

    Unit testing is definitely hot right now, but I think part of writing effective unit tests in knowing when NOT to write them. I like to think of things in terms of responsibilities. If your class has clear cut responsibilities, IE: take in some input, process it, produce output, then it should be unit tested.

    If the logic is more like: take in some input, make calls to an external data source, process this, use the resulting data to drive more requests for data, then update a resource somewhere and return or set status, then you factor these operations out into smaller operations, and unit test those. And write an integration test that validates the overall behavior, without worrying about setting up a pristine environment and testing it in isolation.

    The second half of the quote is also true, indirection may solve the problem at hand, but will tend to introduce complexity elsewhere. In our database/repository example, you can now easily test your service layer in isolation, but you’ve now created yet another set of classes that you have to maintain alongside your database and which, technically, are not tested. It’s always a trade off, and you have to learn to find the sweet spot when breaking your code down into testable chunks, where the advantages of testability outweigh the burden of maintaining the additional indirection.

    And of course this is just one approach to the problem. Creating an in-memory database and letting tests run against it has a different set of tradeoffs.

    I think the one thing we can all agree on is that trying to use Mocks to simulate your database is just a bad, bad idea. :)

    JustinB

    20 Aug 08 at 2:04 pm

  12. justinb nailed this. i dont think it could have been written any better. nice job justin.

    ryan

    21 Aug 08 at 2:45 am

  13. I agree, unit tests accessing a development-database are possible, also with other database systems like Postgres or Oracle.

    DBUnit (http://www.dbunit.org/) is a tool that can help you to get test data into the database easily.

    To migrate or setup a database using SQL-scripts, groovy, Java, DBUnit or a mixture, there is DBMigrate and some other DB-tools (e.g. code-generation based on DDL-Scripts), that you can find at: http://code.google.com/p/agimatec-tools/

    Roman

    21 Aug 08 at 3:06 am

  14. Good article, good food for thought.

    Just to clairify and extend another’s comment; SQL Server has an express edition, which is suitable for many apps as well as many testing scenarios. But even in the case where, for example, you need the features of the enterprise edition of SQL Server, your MSDN seat license allows you to stand up and tear down as many SQL Server instances as you like for testing purposes.

    There are also some nice data tools for MS SQL Server - with sample data generation capability being one of them.

    Dino

    21 Aug 08 at 11:43 am

  15. [...] Here are a couple of concrete ideas. With the conventions I’ve shown, it’s easy to write a simple recursive program that can examine your entire database for data consistency, based only on naming conventions. And you can easily write a program to dump an account and all its related data (client, employee, and so on) for such purposes as migrating a client to a new shard or creating a dataset for a test suite. [...]

Leave a Reply