Must Ignore vs. Microformats

Wednesday, July 12th, 2006

I tend to assume most people know what they’re talking about, especially if they’re talking about something I don’t really understand. Sometimes it takes a really blatant example of just what it is they’re saying before I realize they’re talking out of their posteriors.

For instance, I used to think homeopathy was a vaguely reasonable practice based on traditional herbal medicine. Then one day I was stuck at the pharmacist for fifteen minutes waiting for a prescription. Since I had nothing better to do, I picked up a pamphlet about the principles of homeopathy and started to read. Almost immediately it became clear that there was nothing in the little glass vials except plain water, that there was no possible way any of these “remedies” could do anything except through the placebo effect, and that the whole field was complete and utter bunk.

It’s important to note here that I didn’t read some detailed scientific study about homeopathy. I didn’t read an article in the Skeptical Inquirer debunking homeopathy. I read a really well-written piece by an advocate of homeopathy that explained exactly what homeopathy was and why they thought it worked; and that clear explanation showed me (or anyone with a layperson’s understanding of chemistry) that homeopathy was completely bogus. I have recently had the same experience with microformats.

Please Sir. Can I have some more XML?

Friday, December 9th, 2005

Here’s some code I had to write this morning. This isn’t all of it, and it isn’t done yet:


A Brief Introduction to XInclude

Wednesday, December 8th, 2004

It’s often convenient to divide long XML documents into multiple files. The classic example is a book, customarily divided in chapters. Each chapter may be further subdivided into sections. Traditionally this has implemented via external entity references. For example,

<?xml version="1.0"?>
<!DOCTYPE book SYSTEM "book.dtd"[
  <!ENTITY chapter1 SYSTEM "malapropisms.xml">
  <!ENTITY chapter2 SYSTEM "mispronunciations.xml">
  <!ENTITY chapter3 SYSTEM "madeupwords.xml">
  <title>The Wit and Wisdom of George W. Bush</title>

However, external entity references have a number of limitations. Among them:

The individual component files cannot be treated in isolation. They often aren’t themselves full, well-formed XML documents. They cannot have document type declarations.

  • The document must have a DTD, and the parser must read the DTD. Not all parsers do.

  • If any of the pieces are missing, then the entire document is malformed. There’s no option for error recovery.

  • Only entire files can be included. You can’t include just one paragraph from a document.

  • There’s no way to include unparsed text such as an example Java program or XML document in a technical book. Only well-formed XML can be included, and all such XML is parsed. (SGML actually had this ability, but it was one of the features XML removed in the process of simplification.)

XInclude is an emerging specification from the W3C that endeavors to create a mechanism for building large XML documents out of their component parts which does not have these limitations. XInclude can combine multiple documents and parts thereof independently of validation. Each piece can be a complete XML document, a part of an XML document, or a non-XML text document like a Java program or an e-mail message.



Monday, November 29th, 2004

Most current XML APIs fall into one of two broad classes: event-based APIs like SAX and XNI or tree-based APIs like DOM, XOM, and JDOM. Most programmers find the tree-based APIs to be easier to use; but such APIs are less efficient, especially with respect to memory usage. An in-memory tree tends to be several times larger than the document it models. Thus tree APIs are normally not practical for documents larger than a few megabytes in size or in memory constrained environments such as J2ME. In these situations, a streaming API such as SAX or XNI is normally preferred. A streaming API uses much less memory than a tree API since it doesn’t have to hold the entire document in memory at the same time. It can process the document in small pieces. Furthermore streaming APIs are fast. They can start generating output from the input almost immediately without waiting for the entire document to be read. They don’t have to build excessively complicated tree- data structures they’ll just pull apart again into smaller pieces. However, the common streaming APIs like SAX are all push APIs. They feed the content of the document to the application as soon as they see it, whether the application is ready to receive that data or not. SAX and XNI are fast and efficient, but the patterns they require programmers to adopt are unfamiliar and uncomfortable to many developers.

Pull APIs are a more comfortable alternative for streaming processing of XML. A pull API is based around the more familiar iterator design pattern rather than the less well-known observer design pattern. In a pull API, the client program asks the parser for the next piece of information rather than the parser telling the client program when the next datum is available. In a pull API the client program drives the parser. In a push API, the parser drives the client.

Just about two years ago, I wrote an article for discussing what until now has been the preeminent pull API, XMLPULL. This article identified a number of problems with XMLPULL. The last two paragraphs of that article summed up:

These problems are not casual bugs. They are deliberate design decisions, based on a desire to reduce the footprint of XMLPULL to the minimum possible for J2ME environments. None of these problems are likely to be fixed in the future. The trade-offs made in the name of size may be acceptable if you’re working in J2ME. They are completely unacceptable in a desktop or server environment. Thus XMLPULL seems destined to remain a niche API for developers seeking efficiency at all costs.

Nonetheless, there are some interesting ideas here. Most importantly, the problems I’ve identified stem from implementation issues, not from anything fundamental to a pull-based model for XML processing. A future pull-API that learned from XMLPULL’s mistakes could easily become a real alternative to SAX and DOM.

Now it’s two years later, and I am very pleased to report that the next generation API is here. BEA Systems, working in conjunction with Sun, XMLPULL developers Stefan Haustein and Aleksandr Slominski, XML heavyweight James Clark, and others in the Java Community Process are on the verge of releasing StAX, the Streaming API for XML. StAX is a pull parsing API for XML that avoids most of the pitfalls I noted in XMLPULL. XMLPULL was a nice proof of concept. StAX is suitable for real work.