XML 2.0

Saturday, December 4th, 2010

First for the record, I’m speaking only for myself, not my employer, the W3C, Apple, Google, Microsoft, WWWAC, the DNRC, the NFL, etc.

XML 1.1 failed. Why? It broke compatibility with XML 1.0 while not offering anyone any features they needed or wanted. It was not synchronous with tools, parsers, or other specs like XML Schemas. This may not have been crippling had anyone actually wanted XML 1.1, but no one did. There was simply no reason for anyone to upgrade.

By contrast XML did succeed in replacing SGML because:

  1. It was compatible. It was a subset of SGML, not a superset or an incompatible intersection (aside from a couple of very minor technical points no one cared about in practice)
  2. It offered new features people actually wanted.
  3. It was simpler than what it replaced, not more complex.
  4. It put more information into the documents themselves. Documents were more self-contained. You no longer needed to parse a DTD before parsing a document.

To do better we have to fix these flaws. That is, XML 2.0 should be like XML 1.0 was to SGML, not like XML 1.1 was to XML 1.0. That is, it should be:

  1. Compatible with XML 1.0 without upgrading tools.
  2. Add new features lots of folks want (but without breaking backwards compatibility).
  3. Simpler and more efficient.
  4. Put more information into the documents themselves. You no longer need to parse a schema to find the types of elements.

These goals feel contradictory, but I plan to show they’re not; and map out a path forward.
(more…)

In Praise of Draconian Error Handling, Part 2

Friday, June 5th, 2009

The fundamental reason to prefer draconian error handling is because it helps find bugs. I was recently reminded of this when Peter Murray-Rust thought he had found a bug in XOM. In brief, it was refusing to parse some files other tools let slip right through. In fact, XOM’s strict namespace handling had uncovered a cascading series of bugs that had been missed by various other parsers including Xerces-2j and libxml.

But before I describe what happened, let’s see if you can eyeball this bug. I’ll make it easier by cutting out the irrelevant parts so you know you’re looking right at the bug. Here’s the instance document we start with:

<!DOCTYPE svg SYSTEM 
"http://www.w3.org/TR/2000/03/WD-SVG-20000303/DTD/svg-20000303-stylable.dtd">
<svg/>

And the referenced DTD is:

<!ENTITY % StylableSVG "INCLUDE" >
<!ENTITY % ExchangeSVG "IGNORE" >
<!ENTITY % SVGNamespace "http://www.w3.org/2000/svg-20000303-stylable" >
<!ENTITY % Shared PUBLIC "-//W3C//DTD SVG 20000303 Shared//EN" "svg-20000303-shared.dtd" >
%Shared;

Then in svg-20000303-shared.dtd we find this:

<!ATTLIST svg
  xmlns CDATA #FIXED "%SVGNamespace;"
  %stdAttrs; >

Not obvious, is it? In fact, I looked at this one for quite a while, and consulted several spec documents before Tatu Saloranta figured out what was actually wrong here. If it helps the relevant part of the XML specification is Section 4.4, XML Processor Treatment of Entities and References.

Give up? OK. Here’s what’s happening:
(more…)

What Version of Xerces are you Using?

Monday, March 31st, 2008

XML developers often find themselves struggling with multiple versions of the Xerces parser for Java which support different, slightly incompatible versions of SAX, DOM, Schemas, and even XML itself. Xerces can be hiding in a number of different places including the classpath, the jre/lib/endorsed directory, and even the JDK itself. Here’s how you can find out which version you actually have.
(more…)

The State of Native XML Databases

Monday, August 13th, 2007

I’ve recently been asked by several people to summarize the state of native XML databases for those interested in exploring this space. IMHO, native XML databases are now roughly where relational databases were circa 1994: solid, proven technology that gets the job done but only if you pay big bucks to do it. However, there’s some promising open source activity on the horizon. To be brief, there are roughly four (maybe five) choices to consider:

  • Mark Logic
  • eXist
  • DB2 9
  • Berkeley DB XML

(more…)

North and South

Friday, July 6th, 2007

David Chapelle writes that

To anybody who’s paying attention and who’s not a hopeless partisan, the war between REST and WS-* is over. The war ended in a truce rather than crushing victory for one side–it’s Korea, not World War II. The now-obvious truth is that both technologies have value, and both will be used going forward.

That’s a nice analogy. Take it one step further though. WS-* is North Korea and REST is South Korea. While REST will go on to become an economic powerhouse with steadily increasing standards of living for all its citizens, WS-* is doomed to sixty+ years of starvation, poverty, tyranny, and defections until it eventually collapses from its own fundamental inadequacies and is absorbed into the more sensible policies of its neighbor to the South.
(more…)