XML – The Cafes

XML 2.0

Elliotte Rusty Harold — Sat, 04 Dec 2010 12:06:10 +0000

First for the record, I’m speaking only for myself, not my employer, the W3C, Apple, Google, Microsoft, WWWAC, the DNRC, the NFL, etc.

XML 1.1 failed. Why? It broke compatibility with XML 1.0 while not offering anyone any features they needed or wanted. It was not synchronous with tools, parsers, or other specs like XML Schemas. This may not have been crippling had anyone actually wanted XML 1.1, but no one did. There was simply no reason for anyone to upgrade.

By contrast XML did succeed in replacing SGML because:

It was compatible. It was a subset of SGML, not a superset or an incompatible intersection (aside from a couple of very minor technical points no one cared about in practice)
It offered new features people actually wanted.
It was simpler than what it replaced, not more complex.
It put more information into the documents themselves. Documents were more self-contained. You no longer needed to parse a DTD before parsing a document.

To do better we have to fix these flaws. That is, XML 2.0 should be like XML 1.0 was to SGML, not like XML 1.1 was to XML 1.0. That is, it should be:

Compatible with XML 1.0 without upgrading tools.
Add new features lots of folks want (but without breaking backwards compatibility).
Simpler and more efficient.
Put more information into the documents themselves. You no longer need to parse a schema to find the types of elements.

These goals feel contradictory, but I plan to show they’re not and map out a path forward.

The technical requirement is to maintain compatibility. Existing parsers/tools still work. On a few edge cases, an XML 2.0 parser may report slightly different content than an XML 1.0 parser. E.g. more or less white space. However all XML 2.0 documents are well-formed XML 1.0 documents. It will be possible to write an XML 2.0 parser that is faster, simpler, and easier to use than an XML 1.0 parser. However XML 1.0 parsers will still be able to parse XML 2.0 documents. XML 2.0 parsers will not, however, be able to parse all XML 1.0 documents, just as XML parsers can’t parse arbitrary SGML.

The attitudinal requirement is that XML 2.0 be useful. It has to solve real problems for real users. And it has to solve all the problems XML 1.0 solves too.
Tim Bray’s Skunkworks meets goals #1, #3, and #4 but it didn’t offer #2, so there was never a strong desire to implement it. Languages don’t get replaced simply because there’s something new that does the same thing a little more cleanly or a little better. They get replaced when there’s something new that does the old things better and does new things too.

What we take out

Internal DTD subsets

The internal DTD subset is responsible for much of the complexity and security issues with XML. Get rid of it. XML 2.0 documents cannot contain internal DTD subsets.

Validity

Validity is defined separately. The DOCTYPE declaration, if present, can point to schemas in any language with any validation rules. Perhaps we can use the public identifier to name the type of the schema and the system identifier to point to it. However validation is optional and outside the scope of the spec.

Namespace well-formedness is required

Build in namespaces. Require namespace well-formedness. All namespaces must be absolute URIs.

Neurotic and psychotic documents

All namespace prefixes must be declared on the root element. No prefix may have two different URIs in two different parts of the document. This may require rewriting of namespace prefixes when combining documents, e.g. with XInclude. This is uncommon, but if we have to do it we’ll do it.

Default namespaces may still be declared on any element.

CDATA sections

Eliminate CData sections. They’re unnecessary syntax sugar and just lead to confusion among users and extra work for parser implementers. Ideally I’d also like to eliminate the special treatment of the three character sequence ]]>. However that might break existing parsers.

C1 controls

The only reason C1 controls show up in an XML document is because someone has mislabeled a Windows text file. By forbidding these characters we will catch this problem much earlier.

There are likely some other Unicode compatibility characters we should forbid as well.

DOM and the Infoset

Abandon DOM. Abandon the Infoset. They’re confusing and not what users want or need. Folks can still use them — XML 2.0 is a subset of XML 1.0, after all — but they have no normative standing.

Encourage a variety of different APIs and data models appropriate to their respective domains and languages. However be very clear that the actual text of the document is the normative form. The data model is a representation of the text. The text is not a serialization of the data model.

White space preservation

White space is significant inside an element if and only if xml:space=”preserve”. Otherwise all consecutive white space is collapsed to a single space.

Alternatively, provide a means of identifying elements in which white space should be preserved in the prolog, e.g. through a processing instruction.

There are counter-examples this this–e.g. the HTML pre element–but I think this is what most people want most of the time so it makes sense to make it the default and make white space preservation the one that requires special casing. Some thought is needed to figure out the algorithm though, especially for white space like this

bar

and this

bar/foo>

and this

The quick brown fox jumped over the lazy dog.

Exactly which white space is retained, and to which element is it assigned?

Whatever way we go here, use the same rules for all attribute values. Attribute value normalization was a mistake in XML 1.0 anyway. Drop it.

Most character encodings

Use XML 1.0 name rules but base on Unicode properties for all non-ASCII characters. Provide a means of identifying the Unicode version in use. Default to Unicode 2.0, unless the document declares otherwise. This is the change I’m least interested in, because it may break compatibility, and has no known actual use cases. That is, no one has ever been able to present a document that any actual user wants to produce that cannot be encoded with XML 1.0 name rules.

Require UTF-8 or UTF-16 exclusively. In fact, maybe just require UTF-8. No other encodings are permitted. Use the encoding declaration to identify the Unicode version and recognize XML 2.0? e.g.

Fallback to an XML 1.0 processor if this doesn’t work.

It does feel a little ugly to specify version="1.0" for what I’m calling XML 2.0. However, as long as the document adheres to the XML 1.0 grammar and constraints, this is completely legal. Maybe we shouldn’t even call this a new version of XML but give it a new name. Perhaps YML (because Y comes after X)?

standalone declaration

This means nothing in practice. No one relies on it. Get rid of it.

What we add

More entities

Predefine the HTML 4 character entity set. Otherwise eliminate all general entity references. We can make this work with a required system ID that points to a DTD containing the definitions. Of course XML 2.0 parsers will not actually load this DTD, only XML 1.0 parsers will need to load it.

Links

Build in xml:base and xml:id.

Build in some form of simple links sufficient for use of HTML. Perhaps just xml:link or xml:href, nothing fancier. This contains a URL, and is normally considered semantically equivalent to an HTML . I.e. it’s a blue underlined thing you click on.

Possibly we should even ditch the namespace prefixes. E.g base, id, and link/href attributes will simply be defined with these semantics.

Data Structures and Types

The biggest lack of XML 1.0 is a standard means of encoding basic data structures and types used in programming: lists, sets, maps, structs, ints floats, etc. This is why JSON is so popular. It’s not that these things can’t be encoded in XML 1.0, just that there are so many ways they can be encoded, and libraries provide no support for decoding them.

To support this use case, we will allow an optional xml:type or perhaps just type attribute on all elements. The value is a name from a type library such as XSD primitive types. Predefine a basic type library with the minimal types: integer, decimal, string, boolean, date, time. For example,

The default simple type is string. I.e. we can instead write

H647345

We do not want the full set of XSD types. In particular, we do not want float, double, short, int, and long. Integers have arbitrary size, and real numbers are expressed in base 10. Parsers may express these with more or less precision as they choose.

We also want list and map types and maybe set:


  Fred
  Jane
  Bob 



  34.3
  120.0
  3.10

Here I’ve made the keys simply the element names. Possibly with maps we want to allow or require that the keys also be attributes or even elements, which would enable a broader range of key types.

We probably want to define some sort of simple type defintiion that can be used by parsers rather than explicitly specifying the type of each element. E.g.


  34.3
  120.0
  3.10

Parsers and APIs are encouraged to make this content available to clients in a more cooked form appropriate to the programming language rather than as raw strings and nodes. However these types are all advisory, not compulsory. Further note that these types could be further parsed by a library that sits on top of an XML 1.0 parser. An XML 2.0 parser is not required.

TBD: should XML 2.0 parsers treat violations of the constraints on the semantic types (e.g. Bob) as a fatal well-formedness error or a non-fatal validity error? If we just use the type attribute it would have to be the latter to avoid breaking compatibility with existing documents.

The details of the type system remain to be worked out. How do we name and define new types? What does the syntax of a type definition look like? Do we need collections other than list and map? Exactly which primitive types do we predefine? Will the world really let us get away with integers and decimals or will they scream for int and float? Much work here remains to be done. But the basic idea is sound. We don’t need to reinvent the same type annotations for every vocabulary. Just as XML 1.0 improved on SGML by eliminating the freedom to use different syntax to delimit elements and attributes, XML 2.0 will improve on XML 1.0 by eliminating the freedom to use different syntax to denote types. We will not limit which types one can express. We will simply specify a standard form for denoting type information.

What we can’t do

There are a few minor changes I haven’t been able to figure out how to make while maintaining backward compatibility. These include:

Allowing — to appear in comments
Not making ]]> special
version=”2.0″

If you can figure out how to make these compatible, please do let me know. I’m almost willing to compromise on these minor points of backwards compatibility to simplify the parsing process, but I’m not sure that’s wise.

version="2.0" is the trickiest one. XML 1.0 parsers are not required to error out on this, but in practice many do. Perhaps we should drop the XML declaration completely? I.e. any document with an XML declaration is ipso facto not an XML 2.0 document. Instead all XML 2.0 documents will be identified with a specific DOCTYPE:

As mentioned above, the DTD mentioned by the system identifier is a legal XML 1.0 DTD that defines all the HTML entities. XML 2.0 aware processors will ignore this. The public identifier (which may be empty) contains a MIME media type followed by the URL to the actual schema for the document.

The Way Forward

There’s one other reason XML 1.0 succeeded where XML 1.1 failed: XML 1.0 was designed by a small committee of like-minded folks with a common goal. They didn’t always agree on the route, but they were all driving to the same destination. The W3C pretty much ignored them until they were done. By contrast, XML 1.1 was hobbled by a W3C process that took far longer to accomplish much less. If XML 2.0 is to succeed, it needs to follow XML 1.0, not XML 1.1.

A small group of interested folks should convene outside the W3C and write the spec. One month to agree on goals and requirements; one month to write the first draft. Run it up a flagpole and see who salutes.

Step 2 is to write parsers and APIs for the new draft, and gain some implementation experience. Develop a test suite of sample documents. That will take longer, but is necessary. Work the bugs out of the spec. At the point where the goals seem to be satisfied and the spec is reasonably implemented, present it to the world as a fait accompli. If some organization feels like adopting it as a formal standard, that’s fine, though it’s hardly necessary.

The real goal is to take the lessons of the last 12 years of XML, and apply them to create something even better. Who’s with me?

P.S.

If you want to comment, please be aware that you need to escape < as < and > as >. The comments allow basic HTML but aren’t smart enough to distinguish between plain text and real HTML comments.

In Praise of Draconian Error Handling, Part 2

Elliotte Rusty Harold — Fri, 05 Jun 2009 14:21:15 +0000

The fundamental reason to prefer draconian error handling is because it helps find bugs. I was recently reminded of this when Peter Murray-Rust thought he had found a bug in XOM. In brief, it was refusing to parse some files other tools let slip right through. In fact, XOM’s strict namespace handling had uncovered a cascading series of bugs that had been missed by various other parsers including Xerces-2j and libxml.

But before I describe what happened, let’s see if you can eyeball this bug. I’ll make it easier by cutting out the irrelevant parts so you know you’re looking right at the bug. Here’s the instance document we start with:

And the referenced DTD is:





%Shared;

Then in svg-20000303-shared.dtd we find this:

Not obvious, is it? In fact, I looked at this one for quite a while, and consulted several spec documents before Tatu Saloranta figured out what was actually wrong here. If it helps the relevant part of the XML specification is Section 4.4, XML Processor Treatment of Entities and References.

Give up? OK. Here’s what’s happening:

Although the parameter entity reference %SVGNamespace; appears in a DTD. it appears inside a default attribute value. The parameter entity reference is therefore not resolved until the attribute actually appears in the document. However, parameter entity references are not resolved in document content (including attribute values). There it’s just literal text. In essence this document is:

Although this bug was nigh-on impossible to find by eye, XOM picked it right up. It noticed that Xerces was setting a namespace URI to %SVGNamespace;. XOM didn’t know what the namespace was supposed to be, but it knew that %SVGNamespace; was not a legal absolute URL, and it refused to allow that. Hence the bug was flagged, and we eventually figured out where and how in the DTD the bug really was. And then we figured out that this was an old DTD from a working dfraft spec, and the tool that was having trouble should be upgraded to the new DTD anyway.

However no other XML library I’m aware of would have caught this because none of them make the check that a namespace URL should be absolute. Technically that’s only good practice, not a strict requirement of the spec. However it is a strict requirement for Infoset based specs such as XInclude and XProc. A document that uses non-absolute URLs does not have an Infoset, and therefore one cannot use XInclude or XProc on them, at least not with any confidence that the results will be consistent from tool to tool because the spec simply does not define how to handle this case.

In practice, I have yet to encounter an XML document that did use a relative namespace URI for any good reason. I have had people request an option not to reject documents that use non-absolute URIs for namespaces, and merely allow any string to be passed in. However, if I accepted that request, then XOM would no longer catch real bugs like this. Furthermore, document producers would have less incentive to generate correct documents. The more error correction tools perform, the more error correction they have to perform. Documents gradually deviate further and further from the spec, and tools try to keep up. However different tools implement different forms and amounts of error correction, until eventually we might as well not have a spec at all.

Instead of participating in this race to the bottom, XOM refuses to play. It will parse namespace well-formed XML documents only. If you want to parse something else, you can’t. Fix the documents or fix the process generating the malformed documents. XOM isn’t changing.

What Version of Xerces are you Using?

Elliotte Rusty Harold — Mon, 31 Mar 2008 14:52:38 +0000

XML developers often find themselves struggling with multiple versions of the Xerces parser for Java which support different, slightly incompatible versions of SAX, DOM, Schemas, and even XML itself. Xerces can be hiding in a number of different places including the classpath, the jre/lib/endorsed directory, and even the JDK itself. Here’s how you can find out which version you actually have.

First, from the command line, try this:

$ java -org.apache.xerces.impl.Version
Xerces-J 2.8.0

If that fails:

$ java  org.apache.xerces.impl.Version
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/xerces/impl/Version

Then point the classpath at the jar file you’re loading:

$ java -classpath dtd-xercesImpl.jar  org.apache.xerces.impl.Version
Xerces-J 2.8.0

If you don’t have a Xerces jar in your classpath, then if you’re using Java 5 or later, you likely have the version of Xerces bundled with the JDK. Check its version thusly:

$ java com.sun.org.apache.xerces.internal.impl.Version
Xerces-J 2.6.2

Finally, if you need to find out from within your source code, you invoke the static Version.getVersion() method like so:

String xercesVersion = Version.getVersion();

For comparisons–e.g. to see if you have Xerces version 2.6.1 or later–this simple bit of code will convert most version strings. including Xerces’s, to a decimal value:

    public static double toDouble(String s) {
        try {
            return Double.parseDouble(s);
        }
        catch (NumberFormatException ex) {
            s = removeExtraPeriods(s);
            return Double.parseDouble(s);
        }
    }

    private static String removeExtraPeriods(String s) {
        boolean decimalPointSeen = false;
        StringBuffer result = new StringBuffer(s.length());
        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            if (c >= '0' && c <= '9') result.append(c);
            else if (c == '.' && !decimalPointSeen) {
                result.append(c);
                decimalPointSeen = true;
            }
        }
        return result.toString();
    }

This covers most version string formats. I’ll leave it as an exercise for the reader to extend it to handle cases like comparing 2.6.2 to 2.6.2b.

Incidentally, the same tricks work for Xalan. You just have to point to org.apache.xalan.Version instead of org.apache.xerces.impl.Version:

$ java -classpath xalan.jar org.apache.xalan.Version Xalan Java 2.7.0

The State of Native XML Databases

Elliotte Rusty Harold — Mon, 13 Aug 2007 23:22:50 +0000

I’ve recently been asked by several people to summarize the state of native XML databases for those interested in exploring this space. IMHO, native XML databases are now roughly where relational databases were circa 1994: solid, proven technology that gets the job done but only if you pay big bucks to do it. However, there’s some promising open source activity on the horizon. To be brief, there are roughly four (maybe five) choices to consider:

Mark Logic
eXist
DB2 9
Berkeley DB XML

Mark Logic

Mark Logic is the Oracle of native XML databases. Like Oracle, Mark Logic is a network database server that can handle volumes of data no other product can handle and do so reliably. Also like Oracle, Mark Logic will charge you through the nose for this ability since they’re the only ones who can do it.

Mark Logic has achieved some success in the publishing industry. For instance, Safari runs on top of Mark Logic. Projects that can stomach a six-figure price tag should seriously consider it. (Also like Oracle, Mark Logic doesn’t reveal exact pricing until the salespeople play golf with your C-level execs so I have to guess at the price here. From what I’ve been able to glean, it’s somewhat too expensive for a startup run off of credit card debt and somewhat less expensive than Oracle. )

eXist

eXist is the best current open source native XML database network server, which unfortunately isn’t saying a lot. It’s reasonable for experimentation and small, low-volume, low-throughput projects. You might be able to run a low-to-medium volume blog like this one off it. However it doesn’t really scale into the gigabytes, and is not stable enough for my tastes. Back up frequently to raw XML if you use it. Performance needs some work.

However, eXist is open source and improving. It may be about where MySQL was circa 1997: no real threat to the big boys, yet.

DB2 9

DB2 9 is IBM’s classic relational database, tricked out with XML extensions. This is actually what’s called a hybrid relational-XML database, rather than a pure XML database like eXist or Mark Logic. In DB 2 9 you still have SQL and tables. However there’s a new XML type for fields. Data stored in XML fields can be searched with XQuery. This actually proves to be very useful for applications like blogs where there’s both a lot of narrative content and traditional record-like content. I don’t happen to know of any open source hybrid databases.

Despite IBM’s growing affection for open source, they have not yet released DB2 as open source, nor am I aware of any plans to do so. There is a demo version suitable for experimentation. For production, it’s as expensive and reliable as DB2 has always been. If you’ve got the budget, it’s definitely worth a serious look. If you’re already a DB2 shop, it’s a no-brainer.

Berkeley DB XML

Sleepycat’s Berkeley DB XML (not to be confused with the old Apache Project dbXML) is an XML layer that sits on top of the proven Berkeley DB database. Early versions were a little weak in both performance and standards support, but the current 2.3 release is much improved in these regards. There’s an active team developing it, and Oracle sells support for this product, so it is probably the most reliable and mature of the open source options. Ironically, it is also the one with which I have the least personal experience.

Sleepycat claims Berkeley DB XML will support up to 256 terabytes, which I suspect is optimistic, but it will go well past the point where eXist keels over in exhaustion.

Unlike eXist, DB2, and Mark Logic, Berkeley DB XML is designed as an embedded database, not a full-fledged network server. You’ll need to write code in PHP, C, Java, or several other languages to access its functionality. Whether this is a bug or a feature depends on your needs.

Other Options

They’re a few other products out there I haven’t had the opportunity to look at. Like DB2, Oracle has also added some XML extensions to their product line in the last couple of years. I don’t recommend Oracle to anyone (for non-XML reasons), but if you’re already an Oracle shop, it wouldn’t hurt to check these out.

MySQL 5.1 has added some basic XML functionality for the first time. However XQuery is not supported, and the performance and reliability of these extensions is unproven. I expect it will need a few releases to shake out the bugs. Still, it might be adequate for adding some basic XML support to existing non-mission-critical MySQL applications.

Software AG’s Tamino was one of the first native XML databases. It’s a mature product, and adequate for many applications. Unfortunately it’s been unofficially abandoned by its parent company. No further development work is expected. I recommend you stay away.

DataDirect XQuery is not itself a database. Rather it is an adapter layer that sits on top of your existing payware database such as SQL Server or Oracle and provides an XQuery interface. Why you’d want to use XQuery instead of SQL when talking to a relational database, I’ve never quite been able to fathom. Data Direct XQuery also has adapters for XML files, EDI, and other flat files.

There are a few other open source products out there that are somewhat less mature and reliable than eXist. The Apache Project has released XIndice, but it does not seem to be making rapid progress. As of this writing, it doesn’t even support XQuery, only XPath.

Longer term, I expect big things from the FLWOR Foundation and MXQuery. There are already some alpha quality releases. This could well play the role of Postgres to eXist’s MySQL. However a production quality release is still at least a year away.

Bottom Line

The XML database space is not nearly as mature as the relational database space. The players are still marching onto the field. The game has not even begun. However it promises to be very exciting when it does.

It’s early, but it is now possible to build real applications on top of these products, and several companies have profitably done so. Mark Logic, in particular, has some real customer success stories. Outside of the big publishing houses, though, few developers have yet discovered the power of XML databases and XQuery. The lack of quality open source products has certainly contributed to that , but it’s a minor issue and one that’s being fixed as I type.

The real issue is developer familiarity. Most developers know SQL and tables, at least a little; and few know XQuery and XML. You still hear otherwise intelligent developers make blatantly false statements like “XML is just a file format” or “XML is just a database dump”. It’s actually quite a bit more than that, especially when you go beyond simple XML into the realm of XQuery and the XQuery data model.

The simple truth is that while much data and many applications fit very neatly into tables, even more data doesn’t. Books, encyclopedias, web pages, legal briefs, poetry, and more is not practically normalizable. SQL will continue to rule supreme for accounting, human resources, taxes, inventory management, banking, and other traditional systems where it’s done well for the last twenty years. However, many other applications in fields like publishing have not even had a database backend. It’s not that they didn’t need one. It’s just that the databases of the day couldn’t handle their needs, so content was simply stored in Word files in a file system. It is these applications that are going to be revolutionized by XQuery and XML.

If you’re working in publishing, including web publishing, you owe it to yourself to take a serious look at the available XML databases. If they already meet your needs, use them. If not, check back again again in a year or two when there’ll be more and better choices. The relational revolution didn’t happen overnight, and the XQuery revolution isn’t going to happen overnight either. However it will happen because for many applications the benefits are just too compelling to ignore.

North and South

Elliotte Rusty Harold — Fri, 06 Jul 2007 10:08:24 +0000

David Chapelle writes that

To anybody who’s paying attention and who’s not a hopeless partisan, the war between REST and WS-* is over. The war ended in a truce rather than crushing victory for one side–it’s Korea, not World War II. The now-obvious truth is that both technologies have value, and both will be used going forward.

That’s a nice analogy. Take it one step further though. WS-* is North Korea and REST is South Korea. While REST will go on to become an economic powerhouse with steadily increasing standards of living for all its citizens, WS-* is doomed to sixty+ years of starvation, poverty, tyranny, and defections until it eventually collapses from its own fundamental inadequacies and is absorbed into the more sensible policies of its neighbor to the South.

The analogy isn’t as silly as it sounds either. North Korean/Soviet style “communism” fails because it believes massive central planning works better than the individual decisions of millions of economic actors. WS-* fails because it believes massive central planning works better than the individual decisions of millions of web sites. It’s no coincidence that the WS-* community constantly churns out volume after volume of specification and one tool after another. The WS-* community really believes that developers are too stupid to be allowed to manage themselves. Developers have to be told what to do and kept from getting their grubby little hands all over the network protocols because they can’t be trusted to make the right choices.

By contrast you don’t see a lot of complicated REST frameworks or specifications. You could read all the relevant REST specifications in a slow afternoon (mostly the HTTP spec and a couple of subsidiary RFCs, plus XML and Namespaces in XML. Maybe Atom syntax and Atom-pub too if you feel ambitious.). REST/HTTP sets up a simple economic system based on a few clear rules, and then pretty much gets out of the way to let people do their own thing. It doesn’t even get too upset when people break the rules, and start tunneling everything through POST or deleting items with GET. The main people harmed by such bad decisions will be the sites themselves, and they will be dealt with by the RESTful market.

Plain Text Config Files are Confusing

Elliotte Rusty Harold — Mon, 26 Feb 2007 13:39:15 +0000

There’s a large rebellion over XML config files from programmers who don’t like to type XML and don’t want to learn APIs for processing it. They’d rather limp along with the same scanf code they’ve been using for the last 20 years.

The problem is there really isn’t such a thing as a plain text config file. What there is are specially formatted text files that are easily as complex as the XML equivalent but inconsistent, poorly documented, and easily broken. For instance, consider this extract from LogValidator’s “plain text” config file:

##  MailFrom : From: address for e-mail output  ##
##
## Unless the relevant option is specified when running the LogValidator,
## the mail output will use ServerAdmin (see above) as From: and To:
## This option allows you to override the From: parameter
## DEFAULT  = ServerAdmin
# MailFrom: logvalidator@example.org

## Title : a more useful Subject: for the Mail output and  for HTML Output ##
##
## Tell the mail/HTML output what this config is all about
## and make them use a better subject than the vanilla "LogValidator results"
## DEFAULT = Logvalidator results
# Title = Logvalidator results

##  [apache] DocumentRoot : where the files are located  ## 
##
## For some log formats, it is necessary to know where the actual files
## reside on the server
DocumentRoot /var/www/
</code></pre>
<p>In particular look at the three fields and their format. In the space of three items, we have</p>
<pre><code>Name: Value
Name = Value
Name Value</code></pre>
<p>This is inconsistent and confusing and seems likely to lead to bugs. Perhaps LogValidator should pick one syntax and stick to it? Better yet, just make this all XML. This really is a classic example of why plain text is not simpler than XML. </p>
<p>This is of course just one format for one Perl program. The next program will be a little different still. You’ll have to write a custom parser to handle it, learn a new syntax to write it, and then remember which one you’re using when.  With XML all your fields are clearly delimited by tags, so the boundaries are obvious. With XML, you get to use the same parser every time.</p>
<p>XML may not be simpler for <em>any</em> one config format. XML is, however, much simpler for <em>all</em> config formats.</p>
<hr />
P.S. In the Ruby community, there’s a movement afoot to use Ruby code as the config format. The JavaScript folks are likewise advocating JavaScript config files. That’s a different issue. In both cases, they’re still advocating single, well-defined syntaxes with standard parsers. They are not advocating plain text config files. To the extent that their data formats are only for Ruby or JavaScript programs respectively, this makes some sense. However, for config formats that need to cross language boundaries, XML is still the best choice.  </p>

</article>
<article>
<h1>Murphy’s Law of Co-occurrence Constraints</h1>
<p>Elliotte Rusty Harold — Wed, 20 Dec 2006 12:35:25 +0000</p>
<p>Co-occurrence constraints are a perennial topic at XML conferences because the usual schema languages (DTDs, W3C Schemas, RELAX NG) can’t handle them. Consequently they’re a fertile source of papers like XML 2006’s keynote from Paolo Marinelli on Co-constraint Validation in a Streaming Context. </p>
<p>However, I mentioned in hallway conversation that I wasn’t sure how common or necessary co-occurrence constraints really were. In fact, I didn’t think I’d ever found one in the real world. Naturally two days later I stumbled across several of them in a very common, very frequent real world example.<br />
<span id="more-162"></span></p>
<p>I was putting together a schema for order information for an online store. I’m sure you’ve seen dozens, probably hundreds, of these. One piece of this is the credit card information, for which a a typical element looks like this:</p>
<pre><code><CreditCard>
  <Name>Elliotte Rusty Harold</Name>
  <Number>5123 4567 8901 2345</Number>
  <Type>Mastercard</Type>
  <CVV2>314</CVV2>
  <Expiration>2007-01</Expiration>
  <Address1>6 Metrotech Center</Address1>
  <Address2>Dept. of Computer Science</Address2>
  <City>Brooklyn</City>
  <State>NY</State>
  <Zip>11201</Zip>
</CreditCard></code></pre>
<p>Now imagine we want to validate that. There are actually several coocccurence constraints just in those ten fields:</p>
<ul>
<li>If the card type is American Express, then the first digit of the card is 3</li>
<li>If the card type is Visa, then the first digit of the card is 4</li>
<li>If the card type is Mastercard, then the first digit of the card is 5</li>
<li>If the card type is American Express, the security code (CVV2) is four digits; otherwise it’s three digits.</li>
</ul>
<p>Of course there are also quite a few other things besides co-occurrence constraints we can’t validate in the schema:</p>
<ul>
<li>The credit card is authorized for the purchase.</li>
<li>The expiration date is in the future.</li>
<li>The credit card checksum is correct.</li>
<li>The zip code matches the city and state.</li>
</ul>
<p>I could actually write <a href="http://www-128.ibm.com/developerworks/xml/library/x-custyp/">RELAX NG extension functions</a> to handle the first three, though that might not be the best architecture for the problem. The rule that the zip code must match the city and state is actually a co-occurrence constraint that requires access to external data, and RELAX NG custom type libraries can’t handle that. </p>
<p>Declarative schemas may be a useful tool, and are easier to write than imperative validation code. However, they’re rarely able to check everything you need to check. Schema validation can be the first step in deciding whether to accept a document. It usually shouldn’t be the last.</p>

</article>
<article>
<h1>RELAX Wins</h1>
<p>Elliotte Rusty Harold — Sun, 26 Nov 2006 15:15:35 +0000</p>
<p>Among the XML cognoscenti, the debate is effectively over. Everyone is choosing RELAX NG as their schema language, and compiling to DTDs or W3C XML Schemas as necessary. I don’t know of a single project in the last couple of years that considered both RELAX NG and W3C Schemas and chose to go with the latter. Certainly, there’ve been a lot of W3C Schema adoptions. However those seem to have been made mostly by people who didn’t know they had a choice. In particular, the W3C imprimatur seems very appealing to larger, more bureaucratic organizations such as government agencies.   </p>
<p>With that in mind, I thought it might be useful to list some of the groups (including some of the W3C’s own working groups) who have chosen to do their work in RELAX NG:<br />
<span id="more-153"></span></p>
<ul>
<li>
<p><a href="http://www.w3.org/TR/xhtml2">XHTML 2.0</a></p>
</li>
<li>
<p><a href="http://www.docbook.org/rng/4.4b2/">DocBook 4.4/5.0</a></p>
</li>
<li>
<p><a href="http://www.w3.org/TR/SVG12/schema.html">SVG 1.2</a></p>
</li>
<li>
<p><a href="http://www.w3.org/TR/sXBL/">XBL</a></p>
</li>
<li>
<p><a href="http://www.oasis-open.org/committees/download.php/6037/office-spec-1.0-cd-1.pdf">OpenOffice</a></p>
</li>
<li>
<p>and many more…</p>
</li>
</ul>
<p>Finally libxml, Linux’s standard XML parser, includes full support for  RELAX NG, but only partial and incomplete support for W3C schemas.</p>
<p>That’s a pretty impressive list, but if the fact that all the cool kids are trying it isn’t enough to get you to take a hit (government bureaucrats aren’t known for being all that hip to the cool kids anyway) then maybe this will. RELAX NG is now an official ISO Standard,  ISO/IEC 19757, Part 2. For people and governments who care about such things, ISO documents are real standards. W3C “recommendations” are also-rans.</p>
<p>Try RELAX NG. I promise it will relieve the stress caused by schemas.  </p>

</article>
<article>
<h1>Why Tim Berners-Lee is Wrong</h1>
<p>Elliotte Rusty Harold — Sun, 29 Oct 2006 13:44:45 +0000</p>
<p>The W3C is finally <a href="http://dig.csail.mit.edu/breadcrumbs/node/166">waking up</a> and realizing they’ve got a problem with HTML. The browser vendors are once again abandoning them and <a href="http://www.whatwg.org/specs/web-apps/current-work/">going their own way</a> (except for Microsoft, which is going in a different direction entirely). The W3C has wisely decided to start listening to Mozilla, Opera, and Apple and revisit classic HTML. Unfortunately though they realize they have a problem, they haven’t yet realized what the problem is. Berners-Lee seems to think it’s about “quotes around attribute values and slashes in empty tags and namespaces”, and it’s not. </p>
<p>XHTML is <strong>not</strong> the problem. Well-formedness is certainly not the problem. Hell, even namespaces aren’t really the problem although they’re clunky and ugly and everyone hates them. The problem is that the W3C has abandoned HTML for years. HTML hasn’t moved forward since <a href="http://www.w3.org/TR/1999/REC-html401-19991224">1999</a>. No wonder browser vendors are getting antsy.<br />
<span id="more-151"></span></p>
<p>XHTML (1.0 and 1.1) is nothing but a reformulation of HTML. It is a very good reformulation that offers real benefits to developers and authors. However it doesn’t add any significant new functionality. It makes many tasks easier (especially ones that involve machine processing of HTML) but it doesn’t make anything new possible. Nonetheless it’s an unalloyed good thing, and we should keep it. Berners-Lee complains that:</p>
<blockquote><p>The attempt to get the world to switch to XML, including quotes around attribute values and slashes in empty tags and namespaces all at once didn’t work. The large HTML-generating public did not move, largely because the browsers didn’t complain. Some large communities did shift and are enjoying the fruits of well-formed systems, but not all.</p></blockquote>
<p>The simple fact is that it’s hard to change direction on a moving train. It’s even harder to change direction when that train is made up of millions of independent authors and software vendors. It takes years, but guess what? The train is moving. XHTML is winning. More and more pages are being served in valid XHTML, and more and more tools are generating it. We may never get rid of classic HTML in my life time, but there’s no reason to give up on XHTML now. </p>
<p>The problem is not now and has never been XHTML or well-formedness. The problem is that the W3C lost interest in improvements to HTML and XHTML. Instead they’ve run off and started work on huge, complicated, massive monolithic plugin technologies like XForms, MathML, and <acronym>SVG</acronym>, but even these aren’t the problem themselves. Considered individually they’re each useful and practical. The problem is that the W3C stopped worrying about the smaller problems, like how to DELETE a URL with a web form, how to identify a date in a document, or how to logout of a site that uses HTTP authentication. There’s still a lot of room for improvement in classic HTML and XHTML. There are still elements and attributes and attribute values that are simply missing and glaring by their absence.  </p>
<p>The W3C’s mistake was ignoring these little things while it worked on big problems like MathML and <acronym>SVG</acronym>. What’s needed now is not an abandonment of the good work the W3C has done in XForms, SVG, MathML and most especially XHTML. Instead what we need to do is tie up the loose ends. Finish what Tim Berners-Lee started way back in 1989, and make HTML a really solid language for the writing and reading of narrative content. </p>
<p>Then we can make it even more powerful by mixing in XForms, SVG, MathML, MusicXML, and other pieces. However, we can only do this if we keep well-formedness, keep XHTML, and keep namespaces. These are all critical to enabling HTML to expand beyond the narrow confines of newspapers, blogs, personal home pages, and online stores. Otherwise we’ll be condemned to a hell of tag soup and  JavaScript for all eternity; and that is not a fate I wish to experience.</p>

</article>
<article>
<h1>Flipping Slides with JavaScript</h1>
<p>Elliotte Rusty Harold — Fri, 20 Oct 2006 16:35:53 +0000</p>
<p>I’ve been writing my talk notes in XML and delivering them in HTML for years. These days I rarely if ever use PowerPoint. Especially since my talks tend to be quite code heavy, HTML works much better. It’s much easier to put a decent amount of (still legible) source code on an HTML page than a PowerPoint slide, plus I can scroll if I need to. </p>
<p>One of the most common questions I get when I give one of these talks is how I make the slide advance from one to the next by just hitting one key. It’s actually not that hard, but it does surprise people, so I thought I’d show you.<br />
<span id="more-142"></span></p>
<p>First, look at this technique in action. Read <a href="http://www.cafeconleche.org/slides/albany/syndication/03.html" target="_blank">this slide from a recent talk</a>, and then hit the space bar. Notice how you can keep hitting the space bar to move from one slide to the next. </p>
<p>View Source reveals the trick. It’s just one little JavaScript in the <code>head</code>:</p>
<pre><code><script language="JAVASCRIPT"><!-- 
    var isnav
        
    if (parseInt(navigator.appVersion) >= 4) {
      if (navigator.appName == "Netscape") {
        isNav = true
        document.captureEvents(Event.KEYPRESS)
      }
      else {
        isNav = false
      }
    }
    document.onkeypress = flipslide

    function flipslide(evt) {
      var key = isNav ? evt.which : window.event.keyCode
      if (key == 32 || key == 29 || key == 30 || key == 11) {
      
        location.href="04.html"
      
      }
      else if (key == 37 || key == 31 || key == 12) {
      
        location.href="02.html"
      
      }
      else if (key == 1) {
        location.href="index.html";
      }
    } //  --></script></code></pre>
<p>Yes, the JavaScript is sort of old-fashioned. I first wrote this in 1999 or so and haven’t really looked at it since. Back then, I tried to also capture the arrow keys to allow me to move backward in the deck with one key too. At the time this was possible with the browser I was using (maybe Netscape 4.0?) but it doesn’t seem to work anymore. I should really figure out how to fix that one of these days.</p>
<p>The downside to this code fragment is that each slide does need to know which slide comes next so it can jump to it. If I were manually editing the HTML, that would be painful. Every time I moved a slide I’d have to update all the JavaScript and all the links, so I don’t do that. Instead the raw source is <a href="http://www.cafeconleche.org/slides/albany/syndication/syndication.xml">one XML document</a>, and the <a href="http://www.cafeconleche.org/slides/albany/syndication/slides.xsl">XSLT stylesheet</a> splits that file into individual slides (using a Saxon extension function). As it does so it manually renumbers all the slides and all the links.</p>
<pre><code> <xsl:document method="html" encoding="ISO-8859-1" 
               href="{format-number(position(),'00')}.html">
   <html>
     <head>
       <meta name="description">
         <xsl:attribute name="content">
           <xsl:apply-templates select="description" mode="meta"/>
         </xsl:attribute>
       </meta>
       <title><xsl:value-of select="title"/>
                
     
                
                           
       
        
       
         
           
             Next 
             | Top 
             | Cafe con Leche
             | Cafe au Lait
           
         
         
           
             Start 
             | Previous 
             | Cafe con Leche
             | Cafe au Lait
           
         
         
           
             Previous 
             | Next 
             | Top 
             | Cafe con Leche
             | Cafe au Lait
           
         
       
...

The downside to this approach is that if I add, move, or remove a slide from the deck the numbers the URLs of all the other slides change; and cool URLs don’t change. I should probably revise my stylesheet so it dynamically generates the slide names from the content of each slide rather than an autogenerated number.^* The XSLT could still generate the correct next and previous links.

As a side benefit, this would also optimize for seach engines since search engines are ridiculously sensitive to the words in the URLs.