Most current XML APIs fall into one of two broad classes: event-based APIs like SAX and XNI or tree-based APIs like DOM, XOM, and JDOM. Most programmers find the tree-based APIs to be easier to use; but such APIs are less efficient, especially with respect to memory usage. An in-memory tree tends to be several times larger than the document it models. Thus tree APIs are normally not practical for documents larger than a few megabytes in size or in memory constrained environments such as J2ME. In these situations, a streaming API such as SAX or XNI is normally preferred. A streaming API uses much less memory than a tree API since it doesn’t have to hold the entire document in memory at the same time. It can process the document in small pieces. Furthermore streaming APIs are fast. They can start generating output from the input almost immediately without waiting for the entire document to be read. They don’t have to build excessively complicated tree- data structures they’ll just pull apart again into smaller pieces. However, the common streaming APIs like SAX are all push APIs. They feed the content of the document to the application as soon as they see it, whether the application is ready to receive that data or not. SAX and XNI are fast and efficient, but the patterns they require programmers to adopt are unfamiliar and uncomfortable to many developers.
Pull APIs are a more comfortable alternative for streaming processing of XML. A pull API is based around the more familiar iterator design pattern rather than the less well-known observer design pattern. In a pull API, the client program asks the parser for the next piece of information rather than the parser telling the client program when the next datum is available. In a pull API the client program drives the parser. In a push API, the parser drives the client.
Just about two years ago, I wrote an article for xml.com discussing what until now has been the preeminent pull API, XMLPULL. This article identified a number of problems with XMLPULL. The last two paragraphs of that article summed up:
These problems are not casual bugs. They are deliberate design decisions, based on a desire to reduce the footprint of XMLPULL to the minimum possible for J2ME environments. None of these problems are likely to be fixed in the future. The trade-offs made in the name of size may be acceptable if you’re working in J2ME. They are completely unacceptable in a desktop or server environment. Thus XMLPULL seems destined to remain a niche API for developers seeking efficiency at all costs.
Nonetheless, there are some interesting ideas here. Most importantly, the problems I’ve identified stem from implementation issues, not from anything fundamental to a pull-based model for XML processing. A future pull-API that learned from XMLPULL’s mistakes could easily become a real alternative to SAX and DOM.
Now it’s two years later, and I am very pleased to report that the next generation API is here. BEA Systems, working in conjunction with Sun, XMLPULL developers Stefan Haustein and Aleksandr Slominski, XML heavyweight James Clark, and others in the Java Community Process are on the verge of releasing StAX, the Streaming API for XML. StAX is a pull parsing API for XML that avoids most of the pitfalls I noted in XMLPULL. XMLPULL was a nice proof of concept. StAX is suitable for real work.
(more…)