The State of Native XML Databases

I’ve recently been asked by several people to summarize the state of native XML databases for those interested in exploring this space. IMHO, native XML databases are now roughly where relational databases were circa 1994: solid, proven technology that gets the job done but only if you pay big bucks to do it. However, there’s some promising open source activity on the horizon. To be brief, there are roughly four (maybe five) choices to consider:

  • Mark Logic
  • eXist
  • DB2 9
  • Berkeley DB XML

Mark Logic

Mark Logic is the Oracle of native XML databases. Like Oracle, Mark Logic is a network database server that can handle volumes of data no other product can handle and do so reliably. Also like Oracle, Mark Logic will charge you through the nose for this ability since they’re the only ones who can do it.

Mark Logic has achieved some success in the publishing industry. For instance, Safari runs on top of Mark Logic. Projects that can stomach a six-figure price tag should seriously consider it. (Also like Oracle, Mark Logic doesn’t reveal exact pricing until the salespeople play golf with your C-level execs so I have to guess at the price here. From what I’ve been able to glean, it’s somewhat too expensive for a startup run off of credit card debt and somewhat less expensive than Oracle. )

eXist

eXist is the best current open source native XML database network server, which unfortunately isn’t saying a lot. It’s reasonable for experimentation and small, low-volume, low-throughput projects. You might be able to run a low-to-medium volume blog like this one off it. However it doesn’t really scale into the gigabytes, and is not stable enough for my tastes. Back up frequently to raw XML if you use it. Performance needs some work.

However, eXist is open source and improving. It may be about where MySQL was circa 1997: no real threat to the big boys, yet.

DB2 9

DB2 9 is IBM’s classic relational database, tricked out with XML extensions. This is actually what’s called a hybrid relational-XML database, rather than a pure XML database like eXist or Mark Logic. In DB 2 9 you still have SQL and tables. However there’s a new XML type for fields. Data stored in XML fields can be searched with XQuery. This actually proves to be very useful for applications like blogs where there’s both a lot of narrative content and traditional record-like content. I don’t happen to know of any open source hybrid databases.

Despite IBM’s growing affection for open source, they have not yet released DB2 as open source, nor am I aware of any plans to do so. There is a demo version suitable for experimentation. For production, it’s as expensive and reliable as DB2 has always been. If you’ve got the budget, it’s definitely worth a serious look. If you’re already a DB2 shop, it’s a no-brainer.

Berkeley DB XML

Sleepycat’s Berkeley DB XML (not to be confused with the old Apache Project dbXML) is an XML layer that sits on top of the proven Berkeley DB database. Early versions were a little weak in both performance and standards support, but the current 2.3 release is much improved in these regards. There’s an active team developing it, and Oracle sells support for this product, so it is probably the most reliable and mature of the open source options. Ironically, it is also the one with which I have the least personal experience.

Sleepycat claims Berkeley DB XML will support up to 256 terabytes, which I suspect is optimistic, but it will go well past the point where eXist keels over in exhaustion.

Unlike eXist, DB2, and Mark Logic, Berkeley DB XML is designed as an embedded database, not a full-fledged network server. You’ll need to write code in PHP, C, Java, or several other languages to access its functionality. Whether this is a bug or a feature depends on your needs.

Other Options

They’re a few other products out there I haven’t had the opportunity to look at. Like DB2, Oracle has also added some XML extensions to their product line in the last couple of years. I don’t recommend Oracle to anyone (for non-XML reasons), but if you’re already an Oracle shop, it wouldn’t hurt to check these out.

MySQL 5.1 has added some basic XML functionality for the first time. However XQuery is not supported, and the performance and reliability of these extensions is unproven. I expect it will need a few releases to shake out the bugs. Still, it might be adequate for adding some basic XML support to existing non-mission-critical MySQL applications.

Software AG’s Tamino was one of the first native XML databases. It’s a mature product, and adequate for many applications. Unfortunately it’s been unofficially abandoned by its parent company. No further development work is expected. I recommend you stay away.

DataDirect XQuery is not itself a database. Rather it is an adapter layer that sits on top of your existing payware database such as SQL Server or Oracle and provides an XQuery interface. Why you’d want to use XQuery instead of SQL when talking to a relational database, I’ve never quite been able to fathom. Data Direct XQuery also has adapters for XML files, EDI, and other flat files.

There are a few other open source products out there that are somewhat less mature and reliable than eXist. The Apache Project has released XIndice, but it does not seem to be making rapid progress. As of this writing, it doesn’t even support XQuery, only XPath.

Longer term, I expect big things from the FLWOR Foundation and MXQuery. There are already some alpha quality releases. This could well play the role of Postgres to eXist’s MySQL. However a production quality release is still at least a year away.

Bottom Line

The XML database space is not nearly as mature as the relational database space. The players are still marching onto the field. The game has not even begun. However it promises to be very exciting when it does.

It’s early, but it is now possible to build real applications on top of these products, and several companies have profitably done so. Mark Logic, in particular, has some real customer success stories. Outside of the big publishing houses, though, few developers have yet discovered the power of XML databases and XQuery. The lack of quality open source products has certainly contributed to that , but it’s a minor issue and one that’s being fixed as I type.

The real issue is developer familiarity. Most developers know SQL and tables, at least a little; and few know XQuery and XML. You still hear otherwise intelligent developers make blatantly false statements like “XML is just a file format” or “XML is just a database dump”. It’s actually quite a bit more than that, especially when you go beyond simple XML into the realm of XQuery and the XQuery data model.

The simple truth is that while much data and many applications fit very neatly into tables, even more data doesn’t. Books, encyclopedias, web pages, legal briefs, poetry, and more is not practically normalizable. SQL will continue to rule supreme for accounting, human resources, taxes, inventory management, banking, and other traditional systems where it’s done well for the last twenty years. However, many other applications in fields like publishing have not even had a database backend. It’s not that they didn’t need one. It’s just that the databases of the day couldn’t handle their needs, so content was simply stored in Word files in a file system. It is these applications that are going to be revolutionized by XQuery and XML.

If you’re working in publishing, including web publishing, you owe it to yourself to take a serious look at the available XML databases. If they already meet your needs, use them. If not, check back again again in a year or two when there’ll be more and better choices. The relational revolution didn’t happen overnight, and the XQuery revolution isn’t going to happen overnight either. However it will happen because for many applications the benefits are just too compelling to ignore.

Recommended Reading

If you’d like to learn more, here are a few good places to start:

43 Responses to “The State of Native XML Databases”

  1. Liam Quin Says:

    The XML Query Home Page at http://www.w3.org/XML/Query/ has a list of XQuery implementations that I try to keep up to date, although it doesn’t say anything much about individual implementations right now. It will say more in a month or so.

    Liam

  2. Howard Katz Says:

    Eliotte,
    To my knowledge (and I actually knew at one time but have since forgotten the exact numbers), MarkLogic is +/- around $60K a seat, so it’s a bit under the 6-figure number you were discussing. There’s a full-featured freebie available w/ a built-in data size limit of something like a meg, so it might be useful if you don’t mind being limited to serving small-sized apps. It’s not open source tho (I got the impression from your opening paragraph that you were going to be talking about open source solutions.)

    Also, while BDB XML is open-source, it’s of the GPL viral type, so that if you plan to do more than serve a single website, or have a desktop-based version of your BDB-XML-based app only on your own machine, you’ll also end up spending the big bucks like MarkLogic (and tho again I’ve forgotten the exact numbers, it’s on the order +/- of half of what ML charges).
    Ta,
    Howard

  3. Martin Probst Says:

    I think you’re missing one option, which is X-Hive/DB (X-Hive is my former employer). It’s a stable product, built in 100% platform independent Java, and it runs and performs very well in really large installations – X-Hive has customers in the aerospace industry that store terabytes of data in it. There is a free trial version available from the website.

  4. Kevin S. Clarke Says:

    How about X-Hive?

  5. Asd Says:

    “MarkLogic is +/- around $60K a seat”
    That is considerably more than what you will actually pay for standard Oracle.

    “I don’t recommend Oracle to anyone (for non-XML reasons)”
    Oh, come on, you can’t just say that and not give us your reasons 🙂

  6. Elliotte Rusty Harold Says:

    Well one reason I don’t recommend Oracle to anyone is the non-transparent pricing. Of course, that affects Mark Logic and DB2 as well. I don’t mind paying for quality, supported software, but I don’t trust software whose price varies depending on how much the salesperson thinks you can pay and how close it is to the end of the quarter.

    Sleepycat is open source, and is perfectly fine for open source applications. For instance, you could build and distribute a GPL’d application such as WordPress on top of Berkeley DB XML without paying Oracle a penny. If you want to horde your own software, then you have to pay Oracle something, but that seems fair to me.

  7. Adrian Says:

    2c on Xindice: spent 6 months trying to love it 2 years ago, in my experience it does NOT scale, and it needs to be embedded in Tomcat. Development community somewhat dead back then, unknown how it is now. Was a shame as until you hit the capacity it was kinda cool to use.

  8. Howard Katz Says:

    Eliotte,
    Has Oracle changed the distribution policy for BDB XML that Sleepcat used to use? It used to be it was free as long as you didn’t distribute it. As soon as you shipped even a single copy, you had to make a choice between (1) open-sourcing all and any of your own code that “touched” the database (hence the viral nature I mentioned), or (2) paying their not-insubstantial licensing fee. Has that changed then?

  9. Josef Meckstroth Says:

    Enterprise Software companies usually don’t publish their price lists, but you can get pricing for those who want to sell to the US Goverment at https://www.gsaadvantage.gov/ and searching on the product. So the government pays about $25,000 per CPU for the MarkLogic Standard Edition and twice that for Enterprise Edition. Commercial costs would be a bit higher. There is no licensing for clients, which makes sense since the applications can be built on the MarkLogic server and clients may just be using a browser. Their website says, “The Community License is available free of charge, is not time limited and may be used in production for personal projects. It is limited to a single- or dual-CPU server and imposes a data set limit of 100 megabytes. The Community License is restricted to two copies per company.” 100MB seems to be small stuff to them.

  10. Eric J. Bowman Says:

    I have spent the past year trying with no success to get Oracle to do more than just acknowledge that BDBXML does not install 64-bit on Solaris, and actually do something about it. My company spent dozens of man-hours tracking down various errors in the install scripts, but eventually ran into a problem that only the developers can fix. Unfortunately, Oracle just doesn’t give a damn, leading me to believe that their intention in purchasing Sleepycat was less than honorable. BDBXML is no longer under development, this acquisition had more to do with restraint of trade than embracing an open-source business model, IMO, as they have yet to integrate our year-old fixes into the codebase.

  11. George Feinberg Says:

    Sorry for the product-specific response here, but I need to answer this one.

    Eric,

    First, BDB XML is very much under active development; otherwise, I’d have to explain what I do all day and why they pay me, along with the rest of the team. Have you visited the OTN forums lately? It’s very much active. The same is true for the other products from Sleepycat. The teams are intact, nobody left, we’re growing and we’re busy.

    As for your Solaris 64-bit problem, there have definitely been build issues in the past, but without access to that specific machine type, it’s hard to test. I don’t have a specific record of correspondence about this, and don’t know who you’ve been trying to get to adopt fixes, but if you have any that can be applied to BDB XML 2.3.10 (latest release), please point me to them. The best route is to post in the BDB XML OTN forum (i.e. not in this reply trail).

  12. Leon Katsnelson Says:

    regarding DB2 and the comment “For production, it’s as expensive and reliable as DB2 has always been.” I completely agree on reliable but I could not disagree more on “expensive” comment. DB2 Express-C is available for FREE for development, production and even redistribution and the free version does include pureXML. What it does not offer is support. If you do need support then you can purchase DB2 Express-C 12-month License and Subscription from IBM (and if you are building PHP applications you can also purchase it from Zend.com). The price is US$2,995 for a year for a single server. This price is identical to MySQL Enterprise price and can hardly be called “expensive”. If you consider that you also get built in replication and high availability clustering and remote disaster recovery it is an unbelievable bargain. See http://www.ibm.com/db2/express for more info.

  13. Eric J. Bowman Says:

    Sorry to continue in this reply trail, but in our defense the OTN forum was our first stop:

    http://forums.oracle.com/forums/thread.jspa?messageID=1466565

    The lack of any response led me to try other channels through Oracle, to no avail.

  14. Darin McBeath Says:

    Something else which might be helpful when selecting an XQuery DB implementation is understanding what version of the specification has been implemented as well as what vendor proprietary extensions have been added. For example, not all implementations support full-text search (it is currently not part of the W3C Recommendation). Those which do have done this with proprietary vendor extensions. Similarly, not all implementations support the concept of ‘library modules’. Without the concept of a ‘library module’, it becomes fairly difficult to develop serious applications with resuable code.

  15. Eliot Kimber Says:

    We at Really Strategies also wondered why X-Hive was not on this list. ?

  16. Rav Ahuja Says:

    The part about “DB2 9 is IBM’s classic relational database, tricked out with XML extensions” isn’t accurate. The XML support in DB2 is anything but tricked out. In versions prior to 9, DB2 did have XML extensions (called XML Extender) but DB2 9 is architected to be “XML to the core”. Yes its a hybrid data server that supports both relational and XML data, but the XML support is very much native. And IBM even calls it “pureXML” technology. That is, when you store XML in DB2 9, it isn’t converted to relational data structures under the covers. Rather DB2 9 stores it in a parsed hierarchical format. Which means accessing sub-documents or nodes of an XML document using XQuery or XPath can be really fast as their is no run-time parsing or conversions from raltional structures to XML format. In addition to the pureXML datatype, DB2 9 offers XML indexing, optional schema validation, support for the common programming interfaces, etc. The free edition of DB2 is called DB2 Express-C – ibm.com/db2/express. If you want to use a database for Web 2.0 IBM has just put a a free package called “Web 2.0 Starter Toolkit for DB2” which includes DB2 Express-C and uses PHP to create atom feeds, web services, etc. with just a few clicks – http://www.alphaworks.ibm.com/tech/web2db2

  17. Sixten Says:

    Since you’re including closed-source, commercial products on this list, it’s probably worth mentioning that Microsoft SQL Server 2005 continues to add to their XML hybrid functionality. It seems that they have XML as a field type, support for XQuery and their own “XML Data Manipulation Language”, and functionality to marshal/unmarshal relational data to XML.

    I haven’t played with it much, so I couldn’t say much about performance and such, but I expect to in the near future.

  18. ReFactor.it » DBMS per XML: lo stato dell’arte Says:

    […] a proposito dei limiti dei database relazionali, ecco un post interessante di Elliotte Rusty Harold in cui, facendo una carrellata sull’offerta dei database […]

  19. Guido Says:

    Did you already have the chance to take a look at TigerLogic XML database?

  20. Robert Walpole Says:

    I think you are a little harsh on eXist there Elliote. We have been using eXist to power the Devon Community Directory (http://www.devonline.gov.uk/community) for almost a year now but I am not familiar with the stability or performance problems you refer to. Ok, I accept it’s not a massive database (just over 8500 XML docs at the last count) but this kind of volume would certainly cover a lot of users’ requirements. We do back up our data frequently, as you suggest, but isn’t that just common sense? To say that eXist “might be able to run a low-to-medium volume blog” well, there is no question in my mind that it could do that and a whole lot more besides!

  21. Unilever Centre for Molecular Informatics, Cambridge - Jim Downing » Blog Archive » XML Databases link Says:

    […] Rusty Harold rounds-up the state of the art on XML databases, concluding: – The XML database space is not nearly as mature as the relational database space. The […]

  22. eds.activemath.org Says:

    Why ActiveMath still has its own content-storage?

    The question of storage of OMDocs keeps coming… why the hell do you, in ActivMath, use your own storage solution for OMDoc fragments and not one of the classical SQL or XML databases? Below is a short answer, helped by Eliotte Rusty Harold.

  23. Pierre Says:

    Why focus on xml “native” db? Isn’t that a bit “academic” to focus on such a narrow market? You want xml presentation, storage, searching trough xquery… why not an add-on on proven relational db? Oracle XML Database has sql performance under the hood and more of xml presentation requirement are there. In addition, it has all of Oracle enterprise features like backup, recovery, integration, monitoring, reliability… Okay, this has a price, but since it is integrated in a much large and broader (and useful) category of ‘database’, isn’t it a (non-native) xml database to consider?

  24. simply24 » Blog Archive » The State of Native XML Databases Says:

    […] read more here […]

  25. Ray Kelm Says:

    I noticed that the 8.3 beta version of PostgreSQL has xml hybrid support, in the form of an xml datatype, and an xpath function which can be used for searching.

  26. Mark Says:

    MS SQL 2005 includes an XML data ype (as noted above) and supports a subset of XQuery 1.0.
    It is able to handle rather complex modular schemas with multiple namespaces. Currrently developing intranet app with ColdFusion front end, although asp, jsp etc. would work. Recommend developers check it out SQL 2005. Basically our app converts XML docs into a searchable database and can provide management statistics on productivity, subject coverage, since the XML docs includes this info. [ColdFusion provides a nifty tag which creates dynamically produced charts with little coding. It also allows non-programmers to work on the look and feel and do prototyping without knowing typical coding syntax].

  27. Charles Foster Says:

    Excellent article, although I think you should of investigated the Sedna XML Database, I believe the database is mature and stable enough to be recognised as a serious option for commercial software projects.

  28. Cláudio Maia Says:

    Hi All,

    I’ve done a recent study about this subject in the past days.
    In our case we need support for .NET and somehow our study was somehow related with open-source DBMS.
    Our research revealed that Sedna XML database was probably the right choice.

    Now, my question is: why hasn’t Sedna been considered in this state of the art? Is there any particular reason?

    Cheers

  29. Andy Vodorez Says:

    I don’t know how popular Sedna is, but I guess Sedna is worth noting.

    Half a year ago we did research to choose open source XML database, and Sedna appeared to be the only working with very large amounts of data. We use Sedna in our project now with up to 30 Gb data.

  30. Glen Pepicelli Says:

    The newest versions of Postgres have an XML datatype. There is an XPath function to select nodes from your XML document. In addition, there are several functions to help construct new XML docs.

    It’s better than nothing– and I’m sure there will more more in the next year or two.

  31. Michael Gesmann Says:

    Software AG continues to enhance Tamino and Tamino 8 is due for release in the coming months.
    In contrast to what is said above, it is neither officially nor unofficially abandoned by its parent company.

  32. Rob Tweed Says:

    Add M/DB:X to the list. This is a new concept: an Open Source REST-interfaced lightweight Native XML Database that is designed for use in the Cloud. See http://www.mgateway.com/mdbx.html for details, and http://www.outoftheslipstream.com/node/168 for some other background.

  33. Elliotte Rusty Harold Says:

    Interesting. However M/DB:X is definitely not REST. M/DB:X uses HTTP, but it follows essentially none of the REST principles.

  34. Rob Tweed Says:

    As you’ll have realised from our documentation, we modelled the HTTP interface on Amazon’s SimpleDB and followed its principles but we applied that style of interface to an XML database, not the spreadsheet-like database of SimpleDB.

    Amazon refer to their interface as REST (http://docs.amazonwebservices.com/AmazonSimpleDB/2009-04-15/DeveloperGuide/index.html?REST_RESTAuth.html), so who am I to argue? 🙂

    Whatever you call it, it’s a simple and effective way to deliver an XML database as a set of web services in the cloud.

  35. Elliotte Rusty Harold Says:

    Amazon is wrong and for the same reasons. Learn from other people’s mistakes, but don’t copy them. HTTP != REST.

  36. Rob Tweed Says:

    I have to say that as someone who builds “back-end” stuff, the “Is it REST or isn’t it REST” arguments seem pretty academic and pedantic. Most web server gateways are entirely dumb and simply pass the HTTP verb through along with the name/value pairs that were delivered to the web server in the HTTP request: there’s no particularly magic meaning applied to the HTTP verb. It therefore really doesn’t matter what verb you use in most circumstances. In the case of M/DB:X and SimpleDB, delivering the name/value pairs to the back end is the important thing and the interpretation of what the request means in terms of a transaction is dealt with by analysing the name/value pairs, not the HTTP verb.

    So the fact that you might be using a GET or POST to perform an edit or delete at the back end is pretty irrelevant except from a purist, religious view of REST. It’s a bit like saying you must only drink wine from cut glass goblets, not plastic cups. If all you’re interested in doing is swigging some wine, who really cares what you drink it from? I really think there are more important things to worry about than “does he use DELETE for a delete or PUT for an edit?” “Does it work?” and “Is it simple to use?” are more important in my book.

    I really don’t want to get in a flame war about this, but for the life of me I can’t see what the fuss is about. OK for the sake of agreement here let’s call it “HTTP-interfaced” rather than REST, but personally I’m going to keep swigging that wine from the same plastic cup as the 800lb Amazon gorilla 🙂

  37. Elliotte Rusty Harold Says:

    They’re a lot of reasons to care about the difference between GET and POST (and PUT and DELETE) including the proper behavior of proxy and cache servers, security, repeated client requests, and more. It doesn’t look like it matters until it does, and then it matters a great deal.

  38. Rob Tweed Says:

    Well I suppose if it really was a serious problem then the many high-volume users of SimpleDB would be complaining strongly on the Amazon AWS forums but not a dicky-bird so far to my knowledge. Indeed I suspect that much of Amazon’s logic for sticking to the common GET and POST of the broader web for their “REST interfaces” is that they’re more likely to be handled by intermediate proxies, firewalls, gateways etc in a standard way: if it works for the broader web, it will probably work for them.

  39. Rob Tweed Says:

    The next step, by the way, for M/DB:X is to add an optional JSON interface as an alternative. This makes it pretty interesting (and puts it up against CouchDB): persistent XML DOM-based storage of Javascript Objects, and an ability to cross-convert between XML or JSON in and XML or JSON out. And of course an ability to modify/construct and query the XML DOM realisation of a corresponding Javascript/JSON object.

  40. John Says:

    Currently at this time the answer is simple.

    If you’ve got money, MarkLogic.
    If you haven’t got money, Sedna.

    Job done.

  41. Gene Thomas Says:

    All of this talk about Oracle (not BDB XML), IBM DB2 9 (including express-C), and SQL Server (2005 & 2008) supporting XML is bogus since none of them support the full set of XPath Axes and do not fully support xQuery.

    None of them support following or preceding sibling axes. For our project we considered MarkLogic, X-Hive (now XDb), eXist, and Sedna. We wound up choosing eXist and currently have almost 11,000 xml documents averaging 200K in size.

  42. Peter Gibbon Says:

    That’s incredible Gene, We’ve been working on a problem recently where we couldn’t scale eXist past 300 megabytes of data on a state of the art Sun Blade Server, let alone 2 gigabytes!?! How on earth did you do that?

    Solutions to problems with XML Databases and XQuery generally either store thousands of smaller “record” like XML documents, ranging between 1 to 20k in size, or a few (like 3 or 4) very large XML documents (like 500 meg to 2 gigabytes in size).

    11,000 xml documents averaging 200K in size sounds very much like a number made up our of thin air to me, especially given our experience with eXist.

    When we tried MarkLogic and Sedna, both could handle the data fine. We are still in the decision making process but eXist is kind of an office joke here.

  43. Malcolm Davis Says:

    We have eXist running dozens of simultaneous database with some database as large as 1 GB, and a single xml file as large as 500 MB. (Release 1.4.0-rev10440, on Linux)

    eXist seems to perform and scale well (this might be due configuration of Java and the OS)

    eXist has some configuration and tuning items that need to be tweaked for scalability/performance. (Some require changes to a build file, and rebuild with Ant)