I’ve recently been asked by several people to summarize the state of native XML databases for those interested in exploring this space. IMHO, native XML databases are now roughly where relational databases were circa 1994: solid, proven technology that gets the job done but only if you pay big bucks to do it. However, there’s some promising open source activity on the horizon. To be brief, there are roughly four (maybe five) choices to consider:
- Mark Logic
- DB2 9
- Berkeley DB XML
Mark Logic is the Oracle of native XML databases. Like Oracle, Mark Logic is a network database server that can handle volumes of data no other product can handle and do so reliably. Also like Oracle, Mark Logic will charge you through the nose for this ability since they’re the only ones who can do it.
Mark Logic has achieved some success in the publishing industry. For instance, Safari runs on top of Mark Logic. Projects that can stomach a six-figure price tag should seriously consider it. (Also like Oracle, Mark Logic doesn’t reveal exact pricing until the salespeople play golf with your C-level execs so I have to guess at the price here. From what I’ve been able to glean, it’s somewhat too expensive for a startup run off of credit card debt and somewhat less expensive than Oracle. )
eXist is the best current open source native XML database network server, which unfortunately isn’t saying a lot. It’s reasonable for experimentation and small, low-volume, low-throughput projects. You might be able to run a low-to-medium volume blog like this one off it. However it doesn’t really scale into the gigabytes, and is not stable enough for my tastes. Back up frequently to raw XML if you use it. Performance needs some work.
However, eXist is open source and improving. It may be about where MySQL was circa 1997: no real threat to the big boys, yet.
DB2 9 is IBM’s classic relational database, tricked out with XML extensions. This is actually what’s called a hybrid relational-XML database, rather than a pure XML database like eXist or Mark Logic. In DB 2 9 you still have SQL and tables. However there’s a new XML type for fields. Data stored in XML fields can be searched with XQuery. This actually proves to be very useful for applications like blogs where there’s both a lot of narrative content and traditional record-like content. I don’t happen to know of any open source hybrid databases.
Despite IBM’s growing affection for open source, they have not yet released DB2 as open source, nor am I aware of any plans to do so. There is a demo version suitable for experimentation. For production, it’s as expensive and reliable as DB2 has always been. If you’ve got the budget, it’s definitely worth a serious look. If you’re already a DB2 shop, it’s a no-brainer.
Berkeley DB XML
Sleepycat’s Berkeley DB XML (not to be confused with the old Apache Project dbXML) is an XML layer that sits on top of the proven Berkeley DB database. Early versions were a little weak in both performance and standards support, but the current 2.3 release is much improved in these regards. There’s an active team developing it, and Oracle sells support for this product, so it is probably the most reliable and mature of the open source options. Ironically, it is also the one with which I have the least personal experience.
Sleepycat claims Berkeley DB XML will support up to 256 terabytes, which I suspect is optimistic, but it will go well past the point where eXist keels over in exhaustion.
Unlike eXist, DB2, and Mark Logic, Berkeley DB XML is designed as an embedded database, not a full-fledged network server. You’ll need to write code in PHP, C, Java, or several other languages to access its functionality. Whether this is a bug or a feature depends on your needs.
They’re a few other products out there I haven’t had the opportunity to look at. Like DB2, Oracle has also added some XML extensions to their product line in the last couple of years. I don’t recommend Oracle to anyone (for non-XML reasons), but if you’re already an Oracle shop, it wouldn’t hurt to check these out.
MySQL 5.1 has added some basic XML functionality for the first time. However XQuery is not supported, and the performance and reliability of these extensions is unproven. I expect it will need a few releases to shake out the bugs. Still, it might be adequate for adding some basic XML support to existing non-mission-critical MySQL applications.
Software AG’s Tamino was one of the first native XML databases. It’s a mature product, and adequate for many applications. Unfortunately it’s been unofficially abandoned by its parent company. No further development work is expected. I recommend you stay away.
DataDirect XQuery is not itself a database. Rather it is an adapter layer that sits on top of your existing payware database such as SQL Server or Oracle and provides an XQuery interface. Why you’d want to use XQuery instead of SQL when talking to a relational database, I’ve never quite been able to fathom. Data Direct XQuery also has adapters for XML files, EDI, and other flat files.
There are a few other open source products out there that are somewhat less mature and reliable than eXist. The Apache Project has released XIndice, but it does not seem to be making rapid progress. As of this writing, it doesn’t even support XQuery, only XPath.
Longer term, I expect big things from the FLWOR Foundation and MXQuery. There are already some alpha quality releases. This could well play the role of Postgres to eXist’s MySQL. However a production quality release is still at least a year away.
The XML database space is not nearly as mature as the relational database space. The players are still marching onto the field. The game has not even begun. However it promises to be very exciting when it does.
It’s early, but it is now possible to build real applications on top of these products, and several companies have profitably done so. Mark Logic, in particular, has some real customer success stories. Outside of the big publishing houses, though, few developers have yet discovered the power of XML databases and XQuery. The lack of quality open source products has certainly contributed to that , but it’s a minor issue and one that’s being fixed as I type.
The real issue is developer familiarity. Most developers know SQL and tables, at least a little; and few know XQuery and XML. You still hear otherwise intelligent developers make blatantly false statements like “XML is just a file format” or “XML is just a database dump”. It’s actually quite a bit more than that, especially when you go beyond simple XML into the realm of XQuery and the XQuery data model.
The simple truth is that while much data and many applications fit very neatly into tables, even more data doesn’t. Books, encyclopedias, web pages, legal briefs, poetry, and more is not practically normalizable. SQL will continue to rule supreme for accounting, human resources, taxes, inventory management, banking, and other traditional systems where it’s done well for the last twenty years. However, many other applications in fields like publishing have not even had a database backend. It’s not that they didn’t need one. It’s just that the databases of the day couldn’t handle their needs, so content was simply stored in Word files in a file system. It is these applications that are going to be revolutionized by XQuery and XML.
If you’re working in publishing, including web publishing, you owe it to yourself to take a serious look at the available XML databases. If they already meet your needs, use them. If not, check back again again in a year or two when there’ll be more and better choices. The relational revolution didn’t happen overnight, and the XQuery revolution isn’t going to happen overnight either. However it will happen because for many applications the benefits are just too compelling to ignore.
If you’d like to learn more, here are a few good places to start: