Archive for the 'XML' Category

Improved Cacti monitoring templates for MySQL

Download MySQL Cacti templates

As promised, I’ve created some improved software for monitoring MySQL via Cacti. I began using the de facto MySQL Cacti templates a while ago, but found some things I needed to improve about them. As time passed, I rewrote everything from scratch. The resulting templates are much improved.

You can grab the templates by browsing the source repository on the project’s homepage.

In no particular order, here are some things I improved:

  • Standard polling interval and graph size by default.
  • Full captions on every graph; you don’t have to guess at how big the values are. Each graph has current, max, and average values printed at the bottom for every value on it.
  • Much more data is captured. I’ve graphed almost everything I could think of.
  • The graphs are grouped better. Most graphs have only related values. There are some exceptions, but not many.
  • The templates don’t hijack your existing installation. They don’t depend on or alter anything in your default Cacti installation.
  • The script that gathers the data is totally rewritten from scratch, and much improved. For example, the math works on 32-bit systems. It has caching built-in so each poll cycle results in just one request to the server, instead of one request per graph. (This is a weakness of Cacti I’m trying to work around). It also has debugging aids and other good coding stuff.
  • By default, it assumes you have the same username and password across every server you’re monitoring, so you don’t have to fill in a username and password for every single graph you create.
  • One data template == one graph template. This helps work around another Cacti limitation.
  • Lots more. Honestly I can’t really remember everything I’ve done. I’m sure you’ll help me remember by asking me how to get X feature working the way you want, and I’ll go “oh, yeah, that’s another thing I improved…”

Cacti templates are very laborious to create if they’re complex at all; it takes a long time and is very error-prone. Instead of doing it through Cacti’s web interface and exporting a huge XML file, I eliminated the redundancies and created a small, easy-to-maintain file from which I generate the XML template with a Perl script. This gives the added benefit of letting me (or you) generate templates with different parameters such as polling interval or graph size. The README file has the full details. However, I’ve pre-generated a set of templates that matches Cacti’s defaults, so you can probably just use that.

This has taken a lot of time. In particular, I spent a lot of time working on it at my former employer, The Rimm-Kaufman Group (kudos to them for letting me open-source the work) and I just spent most of my weekend writing the scripts to convert from the compact format to XML templates, so it’s possible to maintain these beasts. Plus I had to develop the compact format, too. This took a lot of time because I had to understand the Cacti data model, which is pretty complex.

Please enter issue reports for bugs, feature requests, etc at the Google project homepage, not in the comments of this blog post. I do not look through comments on my blog when I’m trying to remember what I should be working on for a software project.

If these templates help you and you feel like visiting my Amazon.com wishlist and sending something my way, I’d appreciate it!

PS: You may also be interested in Alexey Kovyrin’s list of templates for monitoring servers.

Technorati Tags:, , , , , ,

You might also like:

  1. What’s the best way to choose graph colors?
  2. A new home for innotop in the new year

PostgreSQL Conference East 2008

PostgreSQLAs a I wrote a couple of days ago, I went to the second day of PostgreSQL Conference East 2008 last Sunday. I had a good time and really enjoyed meeting everyone, listening, learning, and occasionally talking. I asked a number of fearless-newbie questions that paid off handsomely: people were very willing to humor me. I also left with a beautiful t-shirt, mug, and bag combo thanks to EnterpriseDB. The bag has already been put to use for a grocery shopping trip.

Note to conference/website organizers: I can’t link to anything but the front page, so I assume my link above will someday point to the 2009 conference, or the 2008 West conference. It would be good to give each event a permalink right from the start…

One thing that surprised me was the distance people traveled to attend. I thought this would be an east-coast USA thing, but people came from Portland, Russia, and beyond.

Roundtable

The first event was an open discussion. At the front of the room were Bruce Momjian, Joshua Drake, Magnus Hagander, and Selena Deckelmann. The first question was about the future of Postgres: what are the goals for the 9.0 release? The answers varied, but generally the sense was that in the future Postgres should continue to add more features and not only catch up to, but surpass the “big boys.” Special mention went to recursive queries, windowing functions, point-in-time recovery, and more standards compatibility.

This was followed by a lengthy discussion on user groups, global vs. local, and so on. One interesting quote here is that no one can buy Postgres because there’s literally no one to buy it from.

After that I poked my hand up and asked what you say to people migrating from other RDBMSs, such as MySQL. I received a warm welcome, a statement that Postgres is hands-down superior to MySQL period, and a lot of interesting commentary on the differences in the communities between the two. I have been thinking a lot on the MySQL community and am not yet ready to put my thoughts into words, so I’ll just give an overview of what the panelists said: the communities are quite opposite in many respects, both organizationally and psychologically.

This was followed by a question about how to encourage development of a feature that “people need.” This also went quite deep into the open-source mindset and development methodology, with people pointing out that the Postgres community is a meritocracy and you cannot co-opt it with money. At the same time, what “the community” wants isn’t what goes into the codebase: the itches that get scratched are the hacker itches, not the community itches. Sometimes these are one and the same.

Apparently one of the community itches is in-place upgrades. I gather that an upgrade requires a dump and reload because releases are not capable of reading files written by previous releases. This sounded like a pretty severe problem, yet the “hacker itch” wasn’t there. People said that they frequently get told “that’s already solved: dump and reload.” Not a solution with large data volumes.

The discussion then turned to why more people aren’t capable of meeting their own needs. My personal belief here is that the big corporations are buying the minds of the smart people by infiltrating universities and schools, and we (we the citizens of the USA, not we the hackers) are just standing by and letting it happen as though it’s a good thing for powerful vested interests to be “giving” our schools “free” software and other things that they cannot inspect, hack, and change. The other problem is that universities aren’t teaching data. They’re teaching everything but data, yet that’s the most important part of the technology economy today. Tools are not as important: they exist only to work with data. You’re lucky to find someone who’s been university-educated in any database, much less an open-source or Free Software one.

Most of what I heard from the panel agreed with my personal views, but they didn’t focus on the problem in the university as much as I feel is important. And just as importantly, perhaps I didn’t hear enough recognition as I wished that there’s a real chance to change this: commercial/opensource companies like EnterpriseDB can really pull a long lever here by counter-infiltrating the classroom. Aside from just legislating the proprietary software right out of the classroom — which I think would be a good start — we can subvert them also.

Around this point someone in the room opined that one of the things that’s unique about Postgres is the difficulty of finding a competent DBA, and the expense of hiring them. This person said that it’s easy to find Oracle DBAs, and you can hire a good MySQL DBA a dime-a-dozen for $35,000 USD per year. I kept my mouth shut, but suffice to say this is not my experience at all. I think we’re all in the same boat here, and this is a case of the grass looking greener on the other side.

The great quote I heard in this session was “We take Oracle DBAs and try to break them.” Someone please step up and take credit for that one :-)

SQL/XML for Developers

This talk was by Lewis Cunningham of EnterpriseDB. He introduced people to XML and then showed the functions that have recently (which release?) been added to Postgres for manipulating XML documents and document fragments. There’s also a native XML datatype, which I asked a few questions about. Apparently it is TEXT under the hood, with a well-formedness check in front of it. I asked a little about the storage format, and was told TEXT is stored out-of-line for large values, lz-compressed, and not allocated a page-at-a-time as with MySQL’s InnoDB engine (so it’s not as wasteful — I wanted to get a sense of whether it would be very inefficient to store XML in Postgres from the memory/disk point of view).

The XML functionality in Postgres is built upon the excellent and ubiquitous libxml2 library. The developer who coded the XML functionality was also in the room. His name is Nikolay Samokhvalov.

I asked about indexing. Since Postgres offers functional indexing (that is, you can index the result of a function — not “its indexing works”), in theory you could index XML documents by indexing the result of an XPath expression, for example. I was looking for the “yes, but” and I got it: there are some planner (query optimizer, for MySQL folks) limitations to this approach.

The great quote from this session was the response to “what would you use instead of Hibernate?” (Hibernate is a Java ORM system). The response was “hand-code it in assembly.” Beautiful.

Big, Bad, Broken, PostgreSQL

This talk was by Robert Treat of OmniTI. He described how a data warehouse turned into a train wreck and how they recovered it. The exact cause of failure is apparently still not known. But it sounded like an interesting, sleepless time. This was a pretty technical discussion. One thing I found interesting was the definition of “large” data warehouse. To my mind, a terabyte or two isn’t exceptionally large. Is that very large in the Postgres world? I’m not trying to be a jerk… just trying to understand. I think one of the reasons it might be large goes back to what people were saying about the need to dump and reload for every upgrade: doing that for a TB of data sounds like a significant barrier to building really large systems.

Monitoring PostgreSQL with ptop

This session was given by Selena Deckelmann. ptop is a top clone that is literally derived from the Unix top utility. It has the ability to monitor current queries as well as looking at the statistics from the operating system itself.

(Tangent: This is an interesting approach, and one which an innotop user has said he’s working on adding to innotop. innotop can monitor many systems at once, but it doesn’t monitor the operating system — it talks only to the MySQL server. This user was talking about opening an SSH connection to each server and looking at /proc/vmstat and /proc/diskstats as well).

Sorry for going off on a tangent. Anyway, ptop is a C app that Selena and one other person maintain. It can show the current processes, list of locks, explain queries, and so on. One interesting limitation is that it can’t monitor a whole server: it’s constrained to a single database. I gather this is because PostgreSQL’s statistics views, which it queries, are per-database.

Afterward

After the conference ended, a few of us piled into cars and followed Bruce into DC for a tour. We visited the Lincoln Memorial, the Viet Nam Memorial, went through the World War II memorial, and up to the Washington Monument. At this point I split and went back home.

All in all a great time and great people, and I’m sorry I missed the first day. This event is so close to me (3 hours drive) that I will really try to make the entire weekend next time, unless it again conflicts with my wife’s 10-mile race schedule.

Technorati Tags:, , , , , , , , , , , , ,

You might also like:

  1. Going to PostgreSQL Conference East
  2. Like it or not, it is the MySQL Conference and Expo
  3. MySQL Conference and Expo 2008, Day Two
  4. MySQL Heartbeat supports PostgreSQL

A PHP implementation of the XML DOM

Download dom4php

Several years ago I wrote a pure PHP library for manipulating XML documents with the Document Object Model (DOM) in PHP 4, without external libraries such as libxml. This is often useful on shared hosting providers, where you can’t get C extensions installed. The library uses PHP4’s built-in SAX functions, which are enabled by default. Today I’m re-releasing this library under the LGPL.

Introduction

It’s not too hard to build a DOM implementation on top of SAX. In fact, many DOM libraries actually use this technique. You just need to know the DOM core specification really well, and understand SAX really well. Everything else is easy, haha. The truth is, I don’t know how well I knew the spec back then, and I’ve no time to check right now, so you’ll have to let me know.

Since I wrote this years ago, before I was enamored of unit testing, I don’t know how good it is. I’ve used it for several years in production systems without ever looking at the actual code again — I just use it and take for granted that it works. I may or may not have time to actually write tests for it (probably not, sorry). Maybe you can help me with that. It shouldn’t be hard, but I just don’t have the time for it.

If you do want to hack the source, I encourage you to be ready to use a debugger. Getting references right is the tricky part. There are lots of references to be built and manipulated in a structure as complex as the DOM, and handling references correctly in PHP 4 is anything but easy for most people.

Documentation

I never wrote much documentation for this library, but I might attempt to remedy that at some point (I probably don’t have time though — sorry). In the meantime, here’s a synopsis to get you started:

<?php

# Create a parser and parse a simple document.
include_once("XmlParser.php");
$parser   = new XmlParser($encoding = 'ISO-8859-1'); # encoding is optional
$document = $parser->parse('<p class="test"><strong>this is a document</strong></p>');

# Add a text node.
$text =& $document->createTextNode('foozle');
$document->childNodes[0]->appendChild($text);

# Navigate around the document a bit, starting at the new node we just added.
$strong =& $text->previousSibling;
echo "The content of the node is '" . $strong->childNodes[0]->data . "'\n";

# Serialize the XML document to a string.  Do NOT use print_r() as the cyclic
# data structures will cause problems.  Instead, create an instance of the
# XmlSerializer class.
include_once("XmlSerializer.php");
$serializer = new XmlSerializer("XML");
echo $serializer->serializeNode($document);
echo "\n";

?>

The real documentation is the DOM core specification, as I said. The object you get back from calling parse() is a Document, and you just use the DOM as normal after that.

Differences from the DOM spec

The DOM spec is pretty heavy-weight, and coding something like this in pure PHP isn’t as efficient as using a C library. I made a couple of compromises for simplicity, performance, and convenience. The result should be a nearly complete DOM implementation, with much less code and overhead than it would take to follow the spec exactly. Here are the differences from the official specification:

  1. ID attributes (refer to the XML spec if you don’t know what that means) are assumed to be named “id” and are kept in a lookup table with the document. This makes sure you can’t duplicate an ID, and provides fast access to any element by ID. If you need to change the name from “id” to something else, you can do that.
  2. Attributes aren’t object-ified. Instead, attributes are stored as a lighter-weight associative array with each Node. You can set and retrieve attributes with object methods, but they aren’t objects themselves.
  3. Node contains some convenience methods not found in the official spec. These are, for example, getElementsByAttributeValue(). Most of them are only used internally, but a few are meant for external use too.
  4. Many of the interfaces in the official spec aren’t really necessary for an 80% solution, including DOMImplementation and NamedNodeMap. I omit those.
  5. No support for namespaces or namespace methods (e.g. createAttributeNS)

There may be other differences too, but I can’t think of them right now. Write into the comments if you see anything I missed. By the way, if you need some of the missing pieces such as NamedNodeMap, I can provide skeleton classes for you; I originally coded them, but then deleted them.

License

I’m releasing this under the GNU LGPL. At one time I had licensed it under the normal GPL, but this isn’t appropriate for a library, so I’m re-licensing it.

Feedback welcome, and thanks for all the fish

Please do leave feedback in the comments. Since I wrote this years ago and haven’t really thought about it since then, I have no idea how good it is — I can only say I haven’t run into any bugs in a while. Maybe I haven’t implemented some things I should have, or maybe there are braindead things I’ve done, who knows. Regardless, I hope you find it helpful.

See you next time!

Technorati Tags:No Tags

You might also like:

  1. Why not to use CSS for columnar layouts
  2. How to write unit tests for ease of refactoring
  3. Simple and complex types in XML Schema
  4. Automatic image captions with unobtrusive JavaScript

Why you should take microformats seriously

Microformats are getting some attention from the group of people who are clueful about the semantic web, but some people are not yet convinced they’re useful. While microformats aren’t as robust and complex as some more fully-featured technologies, some apparent limitations are actually possible to overcome, but with different methods than might be expected. In this article I’ll address some common concerns about microformats, and explain how they are either solvable or already solved.

The problem in a nutshell

The problem many see with microformats is they apparently don’t identify themselves as such; they rely on conventions, which are not yet agreed-upon. To recognize a microformat, you might think that you have to look for some markup such as <div class="bio vcard">. That’s not as formal as some standards, like XML, which require a DTD, a schema, or something equivalent (for the rest of this article, I’ll just call it a doctype).

A doctype is important for several reasons. Lack of a doctype means microformats can’t:

  • be discovered automatically
  • be validated automatically
  • be versioned

These could be show-stopping reasons not to use microformats for some applications. Fortunately, as I’ll explain, microformats can have the equivalent of a doctype — the mechanism is just a bit different from XML.

But wait, there are specs, right?

Yes, there are microformat specifications. The people at microformats.org have done lots of hard work, but these specs are not doctypes. They lack

  • a formal grammar
  • a formal means of identifying and locating the grammar given an instance document
  • a formal means of identifying the very presence of a microformat in the instance document

I’m a purist. Maybe it’s silly, but I’ve seen a lot of data, parsed a lot of data, done a lot of work with interoperability and interchange, and I’ve seen and used a lot of both good and bad data and programs to work with data. People like me are the reason many of the WC3 specs are so complicated. We want data that machines can know how to handle, automatically, without preconceived notions. The specifications at microformats.org explain things to humans, but that doesn’t automatically make them usable as doctypes.

What does “automatic” mean, and why do we need it?

It’s relatively easy to write, for example, a parser for hCards, because they’re pretty simple. It’s pretty easy to take such a parser and give it an XHTML document, and have it figure out whether there are any hCards in the document. So far, so good — one goal of a doctype is achieved, without even needing a doctype.

However, this isn’t everything a doctype gives you. In the example I just gave, the hCard parser has some things built into it. It knows what an hCard looks like, and it has the smarts to find one. In fact, it is built with a pre-conceived notion: “I am an hCard parser, and I look for hCards.” Notice, “hCard parser,” not “microformat parser.” That’s important. This is a “type one” parser (that’s my own naming convention, which I’m inventing just for this article).

Imagine for a moment that I upgrade the hCard parser to a microformat parser. I can give it an instance document, and it can detect the presence of any microformat in the document, parse it, and deliver it to me — suppose it’s a C library and it delivers it to me as a struct of the appropriate type. Still not too hard to do, right? There are only a few different types of microformats as I write this — fewer than twenty. The C library can certainly check for the presence of each. No big deal. Now I have a type two parser.

Now imagine that I have a microformat parser, and it knows nothing about any specific type of microformat, yet it can accept an instance document, find all microformats in it, and deliver them all to me in an appropriate struct. How can it do this? With a formal grammar, of course. That’s what a doctype is for. My new microformat parser is not so much a microformat parser as it is a doctype interpreter. It reads a doctype, generates some way to recognize microformats, and then does it. This is a type three parser.

This third parser is infinitely more powerful than the second and first. From type one to two is really not much of a difference; it’s just incremental enhancement. From two to three, though, is a quantum leap. Type three is a totally different animal. It’s like the difference between writing functions in a procedural language, and writing lambda functions in a functional language (if you’re curious about this, you may want to read about my Javascript date functions).

You cannot do this without a doctype.

Isn’t that too complicated?

That’s a good question. I just said I’m a purist, and many W3C specs are written by purists too, but I’ll be the first to say SOAP is way over-engineered (and don’t even mention all the WS-* specs). On the other hand, the microformats folks are actually creating highly complex specs, so maybe they’re not really trying to avoid complex specs like SOAP and WS-*. In that case, the lack of a doctype could be seen as a serious imbalance — all that complexity, and no doctype to show for it?

In my opinion, the microformat work is going in the right direction — it just isn’t at the stage where uber-formalism has set in, partly because microformats are not ubiquitously used yet. To create microformats, then create uses for them and demonstrate their effectiveness and usefulness, is a very good idea, even if it can’t all be as automated as I said above. I do not think microformat work should start with a committee, and produce a standard before anything else happens. That is an approach some have taken in the past, and it’s flawed. Tim Bray says it well:

I’m deeply suspicious of “standards” built by committees in advance of industry experience…

And Bruce Eckel:

A standards body should formalize existing practice, rather than inventing new practice without experience.

In any case, you could easily argue that doctypes for microformats are not the right thing to do, at least not yet. And I’d agree with you, but I’d also point out that many things doctypes enable are already possible with microformats — you just have to do them a bit differently than you’d think, depending on your background.

Who wants doctypes?

Lots of people, like Steve Farrell, have raised these issues in conversations, on the web, and in mailing lists. In fact, a lot of the discussion has started since I started drafting this article about seven months ago. Unfortunately, there’s some PR work to be done. Some discussions I see are sometimes phrased as a bolt-on solution, or a means of appeasing the purists, which defeats the point. And the existing solutions can be hard for experts in data interoperability to understand — probably because of their expertise — so they sometimes don’t take them seriously enough, thinking the solutions are bolt-on, appease-the-purist jokes.

I hope people don’t continue to be pessimistic about the future of microformats. Many standards have failed because of some show-stopping problem or other, sure, but the world hasn’t ended. One excellent example is the IPv6 mess. Even if there are some shortcomings to microformats, whatever they are will be addressed somehow.

To a man with a hammer…

Microformats are built on XHTML, and therefore XML, so there’s a huge pre-existing toolset for addressing the requirement for a doctype. If you come from the XML world, as I do, your first thought might be “we need to use XML tools to solve this XML problem.” And the first places you’d probably look are XML Schema and good old-fashioned DTDs. These have been proven over and over again, especially DTDs. Let me explain these a bit, and show you why they’re not the solution for microformats.

XML itself, and DTDs, upon which XML is built, also provide methods for extending any document type. One is simply to redefine the DTD through extensions or referring to external DTDs. I’ll give an example of this done brilliantly: TEI, the Text Encoding Initiative. Most web programmers, even those who are hip to the latest and greatest, have never heard of TEI, but behind the scenes, it’s widely used in text encoding, and has been hugely successful; vast quantities of information have been marked up with MEI. For example, historians, librarians, and other academic types use it lots. And I do mean vast quantities, even on the scale of things you consider vast, whoever you are.

DTDs and XML Schema serve similar purposes, so I won’t go into Schema, especially since XHTML is defined by a DTD, not an XML Schema instance. Both tools really do what I’m talking about — doctypes. But we’re talking about XHTML here. Yes, by its nature the X part of it is meant to be extensible, but in practice, web browsers do not read the XHTML doctype to tell them how to parse and render web pages. Web browsers are specialized parsers that know XHTML only. You can’t just feed a browser some arbitrary XML, give it a doctype, and expect it to somehow render this. You also can’t extend XHTML and expect browsers to continue parsing and rendering it correctly, because they are not “type three” parsers in my informal taxonomy. You need to work within the constraints of the technology’s end users, and the place you have some degree of freedom is in the meta-data allowed in XHTML documents. I refer to the class attribute, of course — and remember, one of its official purposes is “general purpose processing by user agents.” This is exactly where microformats are targeted.

Another tool is XML Namespaces, which are partially a formal means of doing what many web programmers try to do informally (using “semantic class names”). I suspect most web programmers don’t know what XML Namespaces are, and probably many who do consider them black magic. There’s definitely a steep learning curve, but they do make sense and are absolutely necessary, so I hope more web programmers will take the time to understand them. (By the way, XML Namespaces and namespaces for a variable in a programming language, for example JavaScript, are not the same thing; there’s been lots of thought lately on scoping and namespaces in JavaScript libraries, but I don’t want you to read this article and think you already know what XML Namespaces are just because you’re a Prototype expert. It’s totally different.)

Back to namespaces — you might be inclined to define an XML Namespace for microformats, define them with that namespace, and think that somehow this provides what you need. Unfortunately, this will break things even worse than extending the XHTML DTD, because a <div> with a custom namespace is, to an XML parser, totally unrelated to an XHTML <div>. The fact that they happen to have the same element name is a coincidence, and an unhappy coincidence at that — hence the need for a namespace to disambiguate and prevent name clashes. That is the problem XML Namespaces are designed to solve, not “semantic extensions” or some such. XML Namespaces gain you nothing here.

If you’re an XML expert, I hope by this point I’ve convinced you that you need to put down your XML tools. Just because you have a hammer doesn’t mean everything is a nail. By the way, I originally started drafting this article because I thought XML tools would be the solution, but I’ve changed my mind. It has been difficult for me to see and understand new ways of defining doctypes for microformats. Now you know why I’ve been drafting this for so long!

So, what’s the solution?

I think, but I’m not sure, the solution is XHTML Meta Data Profiles (XMDP).

It might look like a “hey, would this work?” idea got an acronym, and maybe that’s true. I’m not really clear on where the GMPG work is going to end up, but at this point it looks more promising to me than anything else (digression: Their homepage sounds like an artist’s statement. “…GMPG efforts provoke optimism and empowerment, nevertheless reawakening criticism of complexity…” Go read the whole thing. I still don’t know whether it’s a joke. If it’s a joke, it’s wicked funny. If they’re serious, it’s wicked funny).

Anyway, I want to mention some of the background for this idea of meta-data profiles. This is not something made up out of whole cloth. HTML has many un-explored possibilities. A famous one is the class attribute, which didn’t get recognition as a meta-data container for a long time, and was just relegated to the role of CSS beast of burden. Similarly, HTML 4 defines the profile attribute, which is — like the mechanisms built into DTD and XML Schema — a means of extending (X)HTML, without changing or extending the doctype. This is exactly what’s needed for extensions like microformats, which embed documents within documents. The profile extension mechanism is a generic hook. When an HTML document defines a meta-data profile, it is really saying “hey, there’s more you can find out about me. Go to such-and-such URI to learn more.” And at the referenced URI can be a meta-data definition, which is completely outside the scope of the HTML specification. This is important, because it means the HTML document’s meaning can be extended arbitrarily, not just in the ways the HTML spec’s authors might have foreseen.

This is the furthest thing from a bolt-on solution. If and when the microformat work progresses to the point it’s needed, that’s exactly how doctypes for microformats could be defined.

How does this solve the problems?

Recall my three problems. I said doctypes are needed so microformats can:

  • be discovered automatically
  • be validated automatically
  • be versioned

If you haven’t already, please go read the parts of the HTML spec I linked in the previous section. Then you will understand a great deal more.

Meta-data profiles already solve the problem of automatic discoverability, so the first point is taken care of. If someone creates doctypes for microformats, the second two can be addressed then. That’s not to say they will be, or will be addressed adequately — for instance, someone could create doctypes without a versioning mechanism, which would be an unfortunate oversight. But it can be done.

Conclusion

There’s no need to be pessimistic about microformats. Sure, not everything is formalized yet, but that’s the way it should be! People are proving that microformats are useful before writing formal specs, which is a great thing. And when the formalities are needed, nothing stands in the way. It’s just a question of when and how it happens, and who does it, and whether they do it well or not.

Technorati Tags:No Tags

You might also like:

  1. A PHP implementation of the XML DOM
  2. How to use meta-data to sort itself
  3. How to use extended properties as documentation with sp_showdoc

Simple and complex types in XML Schema

If you’ve worked much with XML Schema, or tried to read a schema, you’ve probably run into markup that refers to simple and complex types. These terms can be confusing. In this article I’ll explain what they mean in simple terminology, and point you to resources that can help you learn more.

Motivation

Suppose I’m writing code to talk to a web service, and I’ve been getting error messages complaining about something called “email” being malformed when I try to invoke the getPreferences operation. I want to figure out exactly what the service expects me to send it. I open up the WSDL and search until I find the relevant definitions:

<element name="email">
  <simpleType>
    <restriction base="xsd:string" />
  </simpleType>
</element>

<element name="getPreferences">
  <complexType>
    <sequence>
      <element name="email" type="email" />
    </sequence>
  </complexType>
</element>

WSDL is written with XML Schema, so I’m looking at an XML Schema document. But I’ve forgotten exactly what it means. What are simpleType and complexType again? Unless I work with schemas fairly frequently, I get confused about this (even though I’ve worked with schemas for many years). I always have to refresh my memory.

The simple answer

The most succinct answer is as follows:

In XML Schema, there is a basic difference between complex types which allow elements in their content and may carry attributes, and simple types which cannot have element content and cannot carry attributes.

That’s from the XML Schema Primer, which I highly recommend. I think it’s probably the best introduction to XML Schema.

To summarize,

  • simple types can only have content directly contained between the element’s opening and closing tags. They cannot have attributes or child elements.
  • complex types can have attributes, can contain other elements, can contain a mixture of elements and text, etc etc.

Now I know the web service is expecting an element that looks like the following:

<getPreferences>
  <email>john@nowhere.com</email>
</getPreferences>

So why am I confused?

This is pretty simple. Why do I have to refresh my memory whenever I haven’t worked with schemas for a few months? The answer is, the above refers to element type. There’s also something called content type, and some of the names are similar (and therefore confusing). Each element’s content is one of several content types:

  1. A simple type’s content can be one of:
    1. atomic types, which have indivisible values, such as #000 and #AACCDD
    2. list types, which have whitespace-separated lists of indivisible values, such as blue green red
    3. union types, which have either atomic or list values, but they can be the union of other types, such as blue #000 red for a set of colors
  2. Complex types have a “content model,” which refers to how the content (the data between the element’s opening and closing tags) is arranged:
    1. simple content is only character data, no child elements allowed
    2. element-only content is only children, no data allowed
    3. mixed content means character data and child elements can be intermingled
    4. empty content means the element is empty (<foo/>) and either conveys information by just existing, or has attributes but no content.

Just to clarify: elements have an element type, and their content has a content type. By the way, attributes can only have simple types, because they cannot themselves have attributes or children.

For further reading, I again heartily recommend the Primer linked above. Another good resource is Priscilla Walmsley’s Definitive XML Schema. She not only knows her stuff (she’s part of the W3C XML Schema Working Group), but she writes very well.

Technorati Tags:No Tags

You might also like:

  1. Why not to use CSS for columnar layouts
  2. How to display an HTML table as a folder tree
  3. Advanced HTML table features, Part 1
  4. Seldom-used HTML form elements
  5. Don’t change a constant variable

GnuCash to MySQL export script

Download gnucash2mysql and queries

It’s tax season, and this year I decided to export my GnuCash data into a MySQL database for analysis. This article is about that process, including source code for the export script and a couple of simple queries.

The script

GnuCash is a powerful double-entry accounting system for managing personal and/or business finances. I’ve used it since about 2002, and find it adequate for my needs. However, the reporting is widely acknowledged to be one of its weakest points. Reporting is implemented in the Scheme programming language, and if the included reports aren’t enough, most people probably won’t be able to write their own. Personally, I find it much easier (for certain reports) to write SQL queries instead. With that in mind, I set out to write a little script that will insert my GnuCash data into a MySQL database.

It’s actually a fairly simple task; the GnuCash developers chose XML as the file format, so the data is easily accessible from other programs. I decided to export only the account structure and transactions, leaving alone the business features such as customers and invoices.

Before I started, though, I spent some time investigating the built-in PostgreSQL backend. The default data format is XML, but the GnuCash developers also have support for storing the data in a PostgreSQL database. Unfortunately, I couldn’t get it to work. At least in version 1.8.11, it seems to be broken. In fact, the GnuCash website and developer mailing lists indicate this is a low priority for right now, as they are working towards a new version which uses an updated graphical toolkit (GTK+). Apparently the source is high-quality, but a bit out of date since the developers plan to revisit that once the new version is ready.

Satisfied that I’m not reinventing the wheel, I went ahead with a script to export the data myself. Since the XML format is really straightforward, I decided to do the simplest thing I could get to work. Perl seemed like a good choice. I am familiar with expat, and Perl has an expat module, so that was also a natural choice for the XML parsing. The basic idea is to push the data through the script like drinking from a firehose, and when it detects certain things — the start of an element, some character data, the end of an element — call a function to handle the data. Depending on what the data is (what element I’m currently examining, etc) I either store it for later reference, or push it into the database. I also automated creating the necessary table structure in the database.

Access to a MySQL database is required. The script uses tables called account, transaction and split. The DDL used to create the tables is at the end of the Perl script file. I used InnoDB tables so I’d have transaction support.

The script requires these Perl modules: DBI, Term::ProgressBar, and XML::Parser::Expat. You can run it with the --help option to see how to run it.

The queries

These queries assume precision decimal math. Versions of MySQL less than 5.0 use imprecise math. I wrote an article on MySQL and decimal math. If you have any issues with these queries, it may help to consult that article.

This query finds all unbalanced non-equity transactions by summing the splits:

select
    s.amount,
    a.name,
    t.description,
    t.posted
from account as a
    inner join (
        select transaction, sum(amount) as amount, max(account) as account
        from split
        group by transaction
        having sum(amount) <> 0
    ) as s on s.account = a.id
    inner join transaction as t on t.id = s.transaction
where a.type <> 'EQUITY'

This query sums all expenses for 2005 by month and account:

select date_format(posted, '%Y-%m') as month, name, sum(amount) as amount
from transaction as t
    inner join split as s on s.transaction = t.id
    inner join (
        select id, name from account
            where type='EXPENSE'
    ) as a on a.id = s.account
where year(posted) = 2005
group by date_format(posted, '%Y-%m'), name
order by date_format(posted, '%Y-%m'), name;

This query finds average monthly expenditures by account since January 2005:

select @num_months := count(distinct date_format(posted, '%Y-%m'))
    from transaction
    where posted >= '2005-01-01';

select cast(sum(amount) / @num_months as decimal(8,2)) as 'Average monthly amount',
    concat(coalesce(grandparent_name, ''),
        if(grandparent_name is null, '', ' > '),
        coalesce(parent_name, ''),
        if(parent_name is null, '', ' > '),
        name) as name
from (
    select date_format(posted, '%Y-%m') as month,
        a.name,
        aa.name as parent_name,
        aaa.name as grandparent_name,
        sum(amount) as amount
    from transaction as t
        inner join split as s on s.transaction = t.id
        inner join (
            select id, name, parent from account
            where type='EXPENSE'
        ) as a on a.id = s.account
        left outer join account as aa on aa.id = a.parent
        left outer join account as aaa on aaa.id = aa.parent
    where posted >= '2005-01-01'
    group by date_format(posted, '%Y-%m'), a.name
) as x
group by name
order by name;
Technorati Tags:No Tags

You might also like:

  1. More GnuCash to MySQL tools and queries
  2. How to use foreign key cascades in MySQL

More alternatives to openxml

Following up on my earlier post, here are some alternate usage scenarios for the UDF I wrote to avoid openxml in SQL Server 2000.

The most obvious variation is to create different versions to return various data types, such as INTEGER or DATETIME or whatnot. I started to do this, but then realized it’s silly. If I duplicate the code for every datatype, what have I accomplished besides duplicating code? The input is VARCHAR, so it’s getting implicitly converted at some point no matter what. It’s better for me to let the end user do the conversion, and not duplicate any code.

My next thought was a function to split apart name-value pairs, such as width=100 height=200 color=blue. Such a function should return three columns: name, value and ident. Again, once I wrote the code, I felt silly. There are at least two fine ways to do it without a new function. One is to pass the names in one input and the values in another. The other is to use two delimiters and just use SUBSTRING to split them apart. In either case, it’s pretty simple; the UDF is doing the looping, and the rest can be done with standard SQL. Here are two ways to do this:

declare @Names varchar(8000),
    @Values varchar(8000),
    @NameValues varchar(8000),
    @Delim1 char(1),
    @Delim2 char(1)
select @Names = 'width height color',
    @Values = '100 200 blue',
    @NameValues = 'width=100 height=200 color=blue',
    @Delim1 = ' ',
    @Delim2 = '='

select l.word as name, r.word as value, l.ident
from dbo.fn_SplitWords(@Names, @Delim1) as l
    inner join dbo.fn_SplitWords(@Values, @Delim1) as r
        on l.ident = r.ident

select 
    substring(word, 1, charindex(@Delim2, word) - 1) as name,
    substring(word,
        charindex(@Delim2, word) + 1,
        len(word) - charindex(@Delim2, word)) as value,
    ident
from dbo.fn_SplitWords(@NameValues, @Delim1)
where charindex(@Delim2, word) > 0

Both queries yield the same results:

These methods both have a shortcoming: it’s not possible to pass missing or zero-length values for a given name. Here is a query that does:

declare @NameValues varchar(8000),
    @Delim1 char(1),
    @Delim2 char(1)
select @NameValues = 'width=100 height=200 color=blue weight= length',
    @Delim1 = ' ',
    @Delim2 = '='

select 
    case when charindex(@Delim2, word) > 0
        then substring(word, 1, charindex(@Delim2, word) - 1)
        else word end
    as name,
    case when charindex(@Delim2, word) > 0
        then substring(word,
        charindex(@Delim2, word) + 1,
        len(word) - charindex(@Delim2, word)) end
    as value,
    ident
from dbo.fn_SplitWords(@NameValues, @Delim1)

From first to last, each of these queries is more flexible and complex than the preceding one. Therefore I prefer them in that order.

Technorati Tags:No Tags

You might also like:

  1. Efficient alternatives to Microsoft SQL Server’s openxml
  2. A bug in Microsoft SQL Server’s replace() function
  3. How to use meta-data to sort itself

Efficient alternatives to Microsoft SQL Server’s openxml

Microsoft SQL Server’s openxml function is highly CPU-intensive. This article demonstrates more efficient alternatives to openxml.

Why people use openxml

A common usage for openxml is to use XML to pass a “list” or “table” of values into a stored procedure. Here is a scenario I’ve seen a lot at my current job:

  1. An ASP accepts a string as input, such as “ipod car adapter”
  2. The ASP parses the input into space-separated values, then concatenates them together into XML
  3. The ASP passes the XML to a stored procedure
  4. The sproc uses openxml to populate a temporary table from the XML
  5. The sproc does some work (such as searching a product catalog)

This is fine, except openxml can cause a huge CPU spike, which is bad news in a Web-facing search server during holiday shopping season.

I have rewritten such a search sproc to remove openxml, and found it to account for upwards of three-fourths of the total cost, even in an otherwise very expensive sproc. I have anecdotal wisdom from other DBAs about people who over-use openxml and end up with systems that run hot all the time.

A list of simple delimited values doesn’t need XML — it’s overkill. A better alternative is to pass the string directly into the sproc, and write a UDF to return a table with the elements of the string. It’s still string parsing, but it’s better than using XML.

Solution 1: a WHILE loop in a user-defined function

Here is a user-defined function that will split a delimited set of words into a table and return the table:

CREATE function dbo.fn_SplitWords (
    @Words varchar(8000),
    @Delim char(1))
    returns @Words_table table(word varchar(255), ident int identity not null)
as begin

    declare @Wordstart int, @WordEnd int, @DoubleDelim varchar(2)
    
    set @Wordstart = 1
    set @WordEnd = 1
    set @DoubleDelim = replicate(@Delim, 2)

    -- Prepare the data
    while charindex(@DoubleDelim, @Words) > 0
        set @Words = replace(@Words, @DoubleDelim, @Delim)
    -- Left-trim it
    if left(@Words, 1) = @Delim
    begin
        if @Delim = ' '
            set @Words = ltrim(@Words)
        else
            set @Words = substring(
                @Words,
                patindex('%[^' + @Delim + ']%', @Words),
                len(@Words) - patindex('%[^' + @Delim + ']%', @Words) + 1)
    end
    -- Right-pad it
    if right(@Words, 1) <> @Delim
        set @Words = @Words + @Delim

    while @Wordstart > 0
    begin
        -- Extract the next word
        set @WordEnd = charindex(@Delim, @Words, @Wordstart)
        if @WordEnd > @WordStart
        begin
            insert into @Words_table select substring(@Words, @Wordstart, @WordEnd - @Wordstart)
            set @Wordstart = @WordEnd + 1
        end
        else set @Wordstart = 0 -- Terminate the loop
    end
    return
end

Please note the bug I’m avoiding in the while loop above. I explain the SQL Server 2000 replace bug in another post.

Here are some test calls for the UDF:

select * from dbo..fn_SplitWords('this is a test call', ' ')
select * from dbo..fn_SplitWords(' this is a test call', ' ')
select * from dbo..fn_SplitWords('this is a test call ', ' ')
select * from dbo..fn_SplitWords('this-is-a-test-call', '-')
select * from dbo..fn_SplitWords('this is a test      call', ' ')
select * from dbo..fn_SplitWords(' ', ' ')
select * from dbo..fn_SplitWords('', ' ')

After I wrote this, I saw someone else did the same thing elsewhere, though in my opinion very poorly implemented and explained. In any case, here’s a link for the sake of completeness: Treat Yourself to Fn_Split()

A more efficient approach

Instead of using a loop as I did above, it’s actually much more efficient to use an integers table and a JOIN to parse the tokens apart. This approach is slightly less flexible, and doesn’t handle all the special cases I handled above with my UDF, such as tokens being separated by several delimiters instead of just one. Regardless, it is absolutely a better way to go, as long as the input is well-formed. It doesn’t use any nonstandard SQL, either — it’s a relational solution to the problem. Here are three resources where you can learn more about this extremely elegant technique:

Other advantages to a UDF

This approach has other advantages over openxml, too:

  1. it’s easy to string-ify a table for input to another sproc from within SQL (see my article about concatenating strings in SQL for more)
  2. it is much simpler, needs much less code, and is easier to understand and maintain
  3. the UDF can be used directly in a FROM clause; there is no need to create a temporary table or table variable unless you want to store and re-use the values
  4. you can use table variables instead of temporary tables, saving disk I/O and locks in the tempdb database
  5. you can pass more data without running into upper limits on the size of varchar, because space-delimited is much less verbose than XML

I would be remiss if I didn’t mention the downsides:

  1. string-parsing is never efficient, and can be error-prone
  2. you will need to create and maintain UDFs (in my case, I need at least two — one for strings and one for integers)
  3. you have less flexibility about types and schemas; this technique is only convenient for simple cases

More about efficiency

SQL server seems to be smart enough to reuse resources within a query batch, so the high cost of using openxml only seems to happen on the first invocation in a batch. When I benchmarked it with a thousand iterations, the string-parsing solution’s constant cost appeared to be about half the constant cost of openxml — not a significant improvement. However, in the common case where it’s used only once, the string-parsing is much more efficient because there is no startup cost.

All in all, I think string-parsing is the lesser of the evils.

Technorati Tags:No Tags

You might also like:

  1. More alternatives to openxml

How to quote and encode XML attribute values

Attribute values in XML are usually double-quoted, but single-quotes can be used as well, according to the relevant part of the XML Spec. Here is the production:

[10] AttValue ::= '"' ([^<&"] | Reference)* '"'
                  | "'" ([^<&'] | Reference)* "'"

In plain English: an attribute consists of

  • a double or single quote
  • any number of the following:
    • any character but <, & or " OR
    • an entity reference
  • the same character that was used to begin the attribute (double or single quote)

What’s most interesting about this to me is that a < is forbidden inside attribute values, but a > is not. I always assumed both were illegal.

This is why I love reading specs. The XML spec is a great example of clear and terse writing. There is no chance for confusion when reading the productions themselves! Any secondary source can only obscure the matter, in my opinion.

Technorati Tags:No Tags

You might also like:

  1. Advanced HTML table features, Part 2