XML 2.0

First for the record, I’m speaking only for myself, not my employer, the W3C, Apple, Google, Microsoft, WWWAC, the DNRC, the NFL, etc.

XML 1.1 failed. Why? It broke compatibility with XML 1.0 while not offering anyone any features they needed or wanted. It was not synchronous with tools, parsers, or other specs like XML Schemas. This may not have been crippling had anyone actually wanted XML 1.1, but no one did. There was simply no reason for anyone to upgrade.

By contrast XML did succeed in replacing SGML because:

It was compatible. It was a subset of SGML, not a superset or an incompatible intersection (aside from a couple of very minor technical points no one cared about in practice)
It offered new features people actually wanted.
It was simpler than what it replaced, not more complex.
It put more information into the documents themselves. Documents were more self-contained. You no longer needed to parse a DTD before parsing a document.

To do better we have to fix these flaws. That is, XML 2.0 should be like XML 1.0 was to SGML, not like XML 1.1 was to XML 1.0. That is, it should be:

Compatible with XML 1.0 without upgrading tools.
Add new features lots of folks want (but without breaking backwards compatibility).
Simpler and more efficient.
Put more information into the documents themselves. You no longer need to parse a schema to find the types of elements.

These goals feel contradictory, but I plan to show they’re not and map out a path forward.

The technical requirement is to maintain compatibility. Existing parsers/tools still work. On a few edge cases, an XML 2.0 parser may report slightly different content than an XML 1.0 parser. E.g. more or less white space. However all XML 2.0 documents are well-formed XML 1.0 documents. It will be possible to write an XML 2.0 parser that is faster, simpler, and easier to use than an XML 1.0 parser. However XML 1.0 parsers will still be able to parse XML 2.0 documents. XML 2.0 parsers will not, however, be able to parse all XML 1.0 documents, just as XML parsers can’t parse arbitrary SGML.

The attitudinal requirement is that XML 2.0 be useful. It has to solve real problems for real users. And it has to solve all the problems XML 1.0 solves too.
Tim Bray’s Skunkworks meets goals #1, #3, and #4 but it didn’t offer #2, so there was never a strong desire to implement it. Languages don’t get replaced simply because there’s something new that does the same thing a little more cleanly or a little better. They get replaced when there’s something new that does the old things better and does new things too.

What we take out

Internal DTD subsets

The internal DTD subset is responsible for much of the complexity and security issues with XML. Get rid of it. XML 2.0 documents cannot contain internal DTD subsets.

Validity

Validity is defined separately. The DOCTYPE declaration, if present, can point to schemas in any language with any validation rules. Perhaps we can use the public identifier to name the type of the schema and the system identifier to point to it. However validation is optional and outside the scope of the spec.

Namespace well-formedness is required

Build in namespaces. Require namespace well-formedness. All namespaces must be absolute URIs.

Neurotic and psychotic documents

All namespace prefixes must be declared on the root element. No prefix may have two different URIs in two different parts of the document. This may require rewriting of namespace prefixes when combining documents, e.g. with XInclude. This is uncommon, but if we have to do it we’ll do it.

Default namespaces may still be declared on any element.

CDATA sections

Eliminate CData sections. They’re unnecessary syntax sugar and just lead to confusion among users and extra work for parser implementers. Ideally I’d also like to eliminate the special treatment of the three character sequence ]]>. However that might break existing parsers.

C1 controls

The only reason C1 controls show up in an XML document is because someone has mislabeled a Windows text file. By forbidding these characters we will catch this problem much earlier.

There are likely some other Unicode compatibility characters we should forbid as well.

DOM and the Infoset

Abandon DOM. Abandon the Infoset. They’re confusing and not what users want or need. Folks can still use them — XML 2.0 is a subset of XML 1.0, after all — but they have no normative standing.

Encourage a variety of different APIs and data models appropriate to their respective domains and languages. However be very clear that the actual text of the document is the normative form. The data model is a representation of the text. The text is not a serialization of the data model.

White space preservation

White space is significant inside an element if and only if xml:space=”preserve”. Otherwise all consecutive white space is collapsed to a single space.

Alternatively, provide a means of identifying elements in which white space should be preserved in the prolog, e.g. through a processing instruction.

There are counter-examples this this–e.g. the HTML pre element–but I think this is what most people want most of the time so it makes sense to make it the default and make white space preservation the one that requires special casing. Some thought is needed to figure out the algorithm though, especially for white space like this

<foo> bar </foo>

and this

<foo>bar/foo>

and this

The quick brown fox jumped over the lazy dog.

Exactly which white space is retained, and to which element is it assigned?

Whatever way we go here, use the same rules for all attribute values. Attribute value normalization was a mistake in XML 1.0 anyway. Drop it.

Most character encodings

Use XML 1.0 name rules but base on Unicode properties for all non-ASCII characters. Provide a means of identifying the Unicode version in use. Default to Unicode 2.0, unless the document declares otherwise. This is the change I’m least interested in, because it may break compatibility, and has no known actual use cases. That is, no one has ever been able to present a document that any actual user wants to produce that cannot be encoded with XML 1.0 name rules.

Require UTF-8 or UTF-16 exclusively. In fact, maybe just require UTF-8. No other encodings are permitted. Use the encoding declaration to identify the Unicode version and recognize XML 2.0? e.g.

<?xml version=”1.0″ encoding=”Unicode_5″?>

Fallback to an XML 1.0 processor if this doesn’t work.

It does feel a little ugly to specify version="1.0" for what I’m calling XML 2.0. However, as long as the document adheres to the XML 1.0 grammar and constraints, this is completely legal. Maybe we shouldn’t even call this a new version of XML but give it a new name. Perhaps YML (because Y comes after X)?

standalone declaration

This means nothing in practice. No one relies on it. Get rid of it.

What we add

More entities

Predefine the HTML 4 character entity set. Otherwise eliminate all general entity references. We can make this work with a required system ID that points to a DTD containing the definitions. Of course XML 2.0 parsers will not actually load this DTD, only XML 1.0 parsers will need to load it.

Links

Build in xml:base and xml:id.

Build in some form of simple links sufficient for use of HTML. Perhaps just xml:link or xml:href, nothing fancier. This contains a URL, and is normally considered semantically equivalent to an HTML <a href=''>. I.e. it’s a blue underlined thing you click on.

Possibly we should even ditch the namespace prefixes. E.g base, id, and link/href attributes will simply be defined with these semantics.

Data Structures and Types

The biggest lack of XML 1.0 is a standard means of encoding basic data structures and types used in programming: lists, sets, maps, structs, ints floats, etc. This is why JSON is so popular. It’s not that these things can’t be encoded in XML 1.0, just that there are so many ways they can be encoded, and libraries provide no support for decoding them.

To support this use case, we will allow an optional xml:type or perhaps just type attribute on all elements. The value is a name from a type library such as XSD primitive types. Predefine a basic type library with the minimal types: integer, decimal, string, boolean, date, time. For example,

<sku type="string">H647345</sku>
<date type="date">2010-10-12</date>
<quantity type="integer">12</quantity>
<price type="decimal" units="dollars">3.45</quantity>
<price type="decimal" units="percent">7.25</quantity>

The default simple type is string. I.e. we can instead write

H647345

We do not want the full set of XSD types. In particular, we do not want float, double, short, int, and long. Integers have arbitrary size, and real numbers are expressed in base 10. Parsers may express these with more or less precision as they choose.

We also want list and map types and maybe set:

<crew type="list">
  <name>Fred</name>
  <name>Jane</name>
  <name>Bob</name> 
</crew>

<dimensions type="map">
  <width type="decimal" units="cm">34.3</width>
  <length type="decimal" units="cm">120.0</length>
  <height type="decimal" units="cm">3.10</height>
</dimensions>

Here I’ve made the keys simply the element names. Possibly with maps we want to allow or require that the keys also be attributes or even elements, which would enable a broader range of key types.

We probably want to define some sort of simple type defintiion that can be used by parsers rather than explicitly specifying the type of each element. E.g.

<dimensions type="map" valuetype="decimal" keytype="string">
  <dimension key="width" units="cm">34.3</dimension>
  <dimension key="length" units="cm">120.0</dimension>
  <dimension key="height" units="cm">3.10</dimension>
</dimensions>

Parsers and APIs are encouraged to make this content available to clients in a more cooked form appropriate to the programming language rather than as raw strings and nodes. However these types are all advisory, not compulsory. Further note that these types could be further parsed by a library that sits on top of an XML 1.0 parser. An XML 2.0 parser is not required.

TBD: should XML 2.0 parsers treat violations of the constraints on the semantic types (e.g. <amount xml:type=’int’>Bob</amount>) as a fatal well-formedness error or a non-fatal validity error? If we just use the type attribute it would have to be the latter to avoid breaking compatibility with existing documents.

The details of the type system remain to be worked out. How do we name and define new types? What does the syntax of a type definition look like? Do we need collections other than list and map? Exactly which primitive types do we predefine? Will the world really let us get away with integers and decimals or will they scream for int and float? Much work here remains to be done. But the basic idea is sound. We don’t need to reinvent the same type annotations for every vocabulary. Just as XML 1.0 improved on SGML by eliminating the freedom to use different syntax to delimit elements and attributes, XML 2.0 will improve on XML 1.0 by eliminating the freedom to use different syntax to denote types. We will not limit which types one can express. We will simply specify a standard form for denoting type information.

What we can’t do

There are a few minor changes I haven’t been able to figure out how to make while maintaining backward compatibility. These include:

Allowing — to appear in comments
Not making ]]> special
version=”2.0″

If you can figure out how to make these compatible, please do let me know. I’m almost willing to compromise on these minor points of backwards compatibility to simplify the parsing process, but I’m not sure that’s wise.

version="2.0" is the trickiest one. XML 1.0 parsers are not required to error out on this, but in practice many do. Perhaps we should drop the XML declaration completely? I.e. any document with an XML declaration is ipso facto not an XML 2.0 document. Instead all XML 2.0 documents will be identified with a specific DOCTYPE:

<!DOCTYPE XML2 PUBLIC "application/xml+xsd http://example.com/optional/actualschema.xsl" http://www.example.com/xml20.dtd">

As mentioned above, the DTD mentioned by the system identifier is a legal XML 1.0 DTD that defines all the HTML entities. XML 2.0 aware processors will ignore this. The public identifier (which may be empty) contains a MIME media type followed by the URL to the actual schema for the document.

The Way Forward

There’s one other reason XML 1.0 succeeded where XML 1.1 failed: XML 1.0 was designed by a small committee of like-minded folks with a common goal. They didn’t always agree on the route, but they were all driving to the same destination. The W3C pretty much ignored them until they were done. By contrast, XML 1.1 was hobbled by a W3C process that took far longer to accomplish much less. If XML 2.0 is to succeed, it needs to follow XML 1.0, not XML 1.1.

A small group of interested folks should convene outside the W3C and write the spec. One month to agree on goals and requirements; one month to write the first draft. Run it up a flagpole and see who salutes.

Step 2 is to write parsers and APIs for the new draft, and gain some implementation experience. Develop a test suite of sample documents. That will take longer, but is necessary. Work the bugs out of the spec. At the point where the goals seem to be satisfied and the spec is reasonably implemented, present it to the world as a fait accompli. If some organization feels like adopting it as a formal standard, that’s fine, though it’s hardly necessary.

The real goal is to take the lessons of the last 12 years of XML, and apply them to create something even better. Who’s with me?

P.S.

If you want to comment, please be aware that you need to escape < as < and > as >. The comments allow basic HTML but aren’t smart enough to distinguish between plain text and real HTML comments.

This entry was posted on Saturday, December 4th, 2010 at 7:06 am and is filed under XML. You can follow any responses to this entry through the Atom feed. Both comments and pings are currently closed.

Pete Cordell Says:
December 4th, 2010 at 8:50 am

This is a good start. Here are my 2 cents…

Neurotic and psychotic documents
I’m not sure that limiting namespace declarations to the root element is that important, but my experience is that it would help if all namespace prefixes were declared before any other attributes in a start tag. This reduces the amount you have to look ahead in event driven code.

White space preservation
I think the default should be to preserve white space, even in attributes. The application can decide to discard it with methods like getTextCollapse(), getTextReplace() and getTextPreserve(). By removing white space you’re modifying the data and I don’t see that what is effectively a transit layer should have the authority to do that.

More entities
I disagree with including more built-in entities. Getting the characters that you want in a document should be an editor issue and not an XML issue.

In the area of entities, the choice of ‘&’ as the escape character seems very unfortunate. It’s too late to change this, but I think the character sequences ‘&’, ‘>’, ‘<’, ‘&apos’, and ‘"’ should cause the replacement they currently do, but any other sequence following an ‘&’ character should have no special meaning. Thus, if you type ‘& then’, your parser returns ‘& then’ rather than an error.

Data Structures and Types
I like the idea of xml:type, but I think it’s only required for complex types when mirroring the functionality of polymorphism. It’s not required for simple types. If an application doesn’t know what an ‘X’ is, then knowing that it is an ‘int’ is not really going to help it.

Comments
I would allow — (two dashes) to appear in comments. There’s no benefit to not allowing — and it reduces the surprises that a novice user might encounter. (In fact ‘A minimum of surprises’ should be an XML 2.0 axiom!)

I think the XML 2.0 spec should have a ‘Compatibility’ section in it that says if you want your XML 2.0 document to be valid XML 1.0, then take note of the following … . One of the things it would mention is comments.

Elliotte Rusty Harold Says:
December 4th, 2010 at 9:22 am

Your comments fall into two categories, things that don’t break compatibility and things that do. Before we can consider whether it’s better to allow unescaped ampersands and — in comments, we’d need to decide that it was worth breaking backwards compatibility. One of the goals here is to preserve the existing toolchain and not require a big bang upgrade of all parser, XSLT processors, schema validators, etc. I’m afraid that if we break syntax level compatibility, then we’ll founder on the same shoals where XHTML 2.0 ran aground.

John Cowan Says:
December 4th, 2010 at 12:28 pm

You might be surprised to hear that the loudest and most frequent single complaint about XML 1.1 was the elimination of C1 characters, which we took out because Rick Jelliffe pointed out that they almost always indicate an error in the encoding. People are okay, really, with extending the set of well-formed XML documents by adding new name characters or whatever. They are absolutely NOT okay with diminishing that set by so much as a single document, no matter how unlikely or useless the feature is. All they can think is, Some documents might have a C1 control, maybe from bad conversion of database data (databases notoriously don’t have proper text encoding information), and now they won’t work. You’ll notice that XML 1.0 Fifth Edition still allows C1 controls, and that’s why.

The XML Core WG just discussed this point last Wednesday, because we are coming up for rechartering. The consensus was that *no* XML 2.0 will ever fly unless and until it has *compelling new features*. Why did MS-DOS 2.0 replace MS-DOS 1.0? Because it had subdirectories and pathnames, even though all Windows systems even today have to have both long and short filename support solely in order to keep MS-DOS 1.0 system calls, and programs that believe in the MS-DOS 1.0 style of filenames, working.) And you aren’t offering a single new feature, never mind a compelling one.

Entities? Big deal, as Pete Cordell says above. (Anyway, there’s a better list than HTML 4’s now.) If you want xlink:href (you no longer need xlink:type, as of XLink 1.1) and xsi:type, you know where to find them. But the real point is that the price for XML 1.0 has already been paid. The parsers are written. The DOMs and XOMs are written. The books are written. Most XML documents are never even seen by a human being; they are created by data-binding libraries and consumed by them. The XML 1.0 infrastructure exists, and nobody’s going to dismantle it in the name of “simpler”, which really means “new bugs”. Likewise, the JSON infrastructure exists, and nobody’s going to replace that either.

New features, new features, new features! Find serious new features and nobody will be more ready to listen than me.

Elliotte Rusty Harold Says:
December 4th, 2010 at 1:43 pm

The biggest new feature promised here is precisely what JSON offers and XML doesn’t: a standard way of representing typed data and basic data structures. I think that’s enough to justify the rest of it, but I could be wrong about that. If you’re right and that’s not enough, then I agree that it follows that this effort will fail.

Arguably we could simply graft the type-annotated data onto XML 1.0 and new APIs. That might be enough to declare victory and go home, though I’d regret the opportunity to clean up some of the uglier points of XML. And it would be difficult to do that without wrapping it in so much namespace cruft that few would accept it and no one would love it. (That, by the way, is a big reason XLink, even in version 1.1, was DOA. The namespaces made it too ugly to stomach.)

Two things JSON offers that this proposal still doesn’t are easy evaluation in JavaScript and a tunnel through browser security restrictions. The first is fixable with a good library. The second may be a problem.

Jesse Wilson Says:
December 4th, 2010 at 3:03 pm

XML was designed to handle markup and to be SGML compatible. Though our industry has tried, XML is not particularly good at modeling structured data with a predefined schema.

XML 1.0-compatible parsers cannot handle structured data better than JSON. Consider your list example. In XML, we have this:
<crew type=”list”>
<name>Fred</name>
<name>Jane</name>
<name>Bob</name>
</crew>
In JSON, equivalent data is represented like this:
[ “Fred”, “Jane”, “Bob” ]
This is easier to read, consumes fewer bytes, and it parses faster. XML 1.0 can’t do this!

XML has its strengths. It is great for marking up text. It’s self-descriptive. Should XML 2.0 play to these strengths and yield structured data to JSON? It’s not even unreasonable to mix the two:
<html>
Today the <crew members=”[‘Fred’, ‘Jane’, ‘Bob’]”>backup crew</crew> will be repairing
the <inventory dimensions=”{‘width’: 34.3, ‘length’: 120.0, ‘height’: 3.10 }”>countertop</item>.
</html>

Michael Kay Says:
December 4th, 2010 at 5:56 pm

It’s not clear to me what your compatibility objectives are. They need to be very clearly stated. You seem to be more interested in having existing software continue to work than existing data continue to work, which seems odd – there is far more investment in data than in software. Clearly if both existing software and existing data have to continue to work then we can’t get rid of any of the existing crud, which rather spoils the fun. So I’d be inclined to break both. Or perhaps to have an objective like this: 99% of real-life well-formed XML 1.0 documents should also be well-formed XML 2.0 documents, and should be processable by either an XML 1.0 or XML 2.0 parser.

Tony Marston Says:
December 5th, 2010 at 5:21 am

Removing the CDATA section is not something I would support. How else would I add content to an XML document, such as an HTML fragment, without having ‘<>’ automatically converted to ‘<’ and ‘>’? Don’t tell me to use ‘disable-output-escaping=”yes”‘ as this doesn’t work in Firefox – bug 98168 has been outstanding since 2001.

David Carlisle Says:
December 5th, 2010 at 6:29 am

I agree very much with your comment re xhtml2. the primary reason why that failed is that they didn’t keep the browser makers on board, and an “xhtml” not supported on the browser, even though it had some good points, didn’t have enough distinguishing features over other xml vocabularies such as docbook/tei/dita.

Of your original comment, wouldn’t the <!DOCTYPE XML 2 PUBLIC suggestion make every xml2 file non-well formed as xml 1?
Or did you not intent a space before the 2 or?

On entities, if we do anything I think we should standardise on the same set as html5, which is
http://www.w3.org/2003/entities/2007/htmlmathml-f.ent

Since this whole push was instigated by an observation that xml could do better “on the web” I suspect that we should also look at allowing an explicit html-like parsing mode (implied end tags, implied attribute quoting, no fatal errors) Also suggested for example here
http://annevankesteren.nl/2007/10/xml5

Julian Reschke Says:
December 5th, 2010 at 7:08 am

Tony: having CDATA doesn’t help you with your XSLT-without-d-o-e problem. You can produce the CDATA section, but you aren’t able to fill it with unescaped markup. FWIW, I believe that FF not allowiing d-o-e is a good thing.

Elliotte Rusty Harold Says:
December 5th, 2010 at 7:10 am

David, you’re right. I didn’t intend the space between the XML and the 2. That was just a typo. Fixed.

Tony, you seem to be confused about what CDATA sections are, what they mean, and what they do. There is *no* difference between <![CDATA[<>]]> and &lt>>. If you’ve somehow hacked together a processing chain in which this somehow matters, then your system is buggy. That this continues to confuse intelligent people is exactly why CDATA sections should not have been included in XML in the first place.

Lorenzo Gatti Says:
December 6th, 2010 at 4:50 am

Most of the proposed changes don’t seem well motivated.

– Why put links in the core XML spec? Anything beyond id and refid within a document cannot be relevant for the document itself; applications that want to follow URIs in attributes and text to fetch document fragments can do so according to extra conventions like HTML links, xml:base etc. or in whatever other way they please.

– Why commit to a specific set of data types? What’s the usefulness of labeling “3.14” as a “double” in the document itself, since any application is expected to know that attribute and text nodes are formatted numbers, dates etc. either implicitly or with the help of other markup? Like links, such mechanisms don’t belong in the general XML specification.

– Why define a truckload of new character entities when we can use any Unicode character directly? Except for necessary escaping of special characters, HTML-style character entities are a pre-Unicode legacy technology that should be gently discouraged and phased out, not adopted in XML with a significant burden to parser implementers.

– Why restrict namespace prefix declarations to one URI per prefix and to the root element only? Apart from wanton disruption of backward compatibility, the current rules are simple and consistent, already implemented without problems, and useful in practice: to include document portions verbatim without needlessly swizzling prefixes around, to have some room for steganography, to attach meaning to prefixes rather than URIs, and (real-world experience) to let DTDs expect prefixes and inject them through entities however they damn please.

– Why give up perfect round-tripping of whitespace? As others have noted, applications can normalize whitespace as they please.

Gratuitous incompatibilities, like entity hate, mandating UTF-8 or UTF-16 and forbidding C1 control characters or CDATA sections, are beneath discussion unless time travel technology allows us to carry our hindsight a few decades into the past to benefit the ASCII, SGML, HTML and XML 1.0 standardization processes.

The only suggestions I agree with are rejecting XML Schema, DOM, XML Infoset etc. as excess baggage and the principle that the document is authoritative.

Jirka Kosek Says:
December 6th, 2010 at 7:01 am

I think it is crucial that XML 2.0 is subset of XML 1.0. Otherwise extensive change in tooling would be required and this will hardly happen.

I think that we need simplified XML:

– no !DOCTYPE at all (just allow <!DOCTYPE xxx> for compatibility with polyglot HTML5)
– no other encoding then UTF-8
– no CDATA

It would be nice to have some new features you are suggesting. Using bunch of new xml: attributes could solve this in backward compatibile way. But I think that we need to give up on entities. They would be nice, but it’s breaking change.

I think that using !DOCTYPE for pointing to other schemas then DTDs is really bad idea. There is already superior alternative to this — <?xml-model processing instruction http://www.w3.org/TR/xml-model/

The question is whether creating such XML profile would have any benefit and would make XML more accepted in typical web development (other niches are quite happy with current XML)

Paul Prescod Says:
December 6th, 2010 at 12:34 pm

It’s a myth that XML was backwards compatible with SGML. Actually, SGML changed to allow it to still be a superset of XML. The new version of SGML was entitled “WebSGML” and it was a really big change.

http://www1.y12.doe.gov/capabilities/sgml/wg8/document/1929.htm

The biggest changes were the ability to have documents without a DTD and the empty tag syntax. Virtually no SGML tools could work with XML out of the box and virtually no SGML documents would work with XML tools out of the box. The claim that XML was an SGML “subset” as opposed to a variant was mostly made for rhetorical reasons, not because it was accurate or realistic.

Etienne Maheu Says:
December 6th, 2010 at 1:42 pm

I agree with you. I think the CDATA syntax must be obliterated. Trigraphs are just one more way to slow down a parser and the CDATA syntax isn’t particularly friendly either. Unfortunately, sometimes, you need to store raw data. Here is what I propose. Add type=”raw” to the list which would prevent the parser to go any further. I don’t really think people use the replacement capability of the CDATA element but, we could always implement them using a find=”” and replace=”” attribute.

mario Says:
December 6th, 2010 at 2:06 pm

I wish someone would have the wisdom to do away with the URI craze. Many consumers rely on explicitely named XML namespaces anyway (see any PHP application), so it makes little sense to allow arbitrary prefixes. Get a damn registry. It’s not like there are really a quadrizillion different namespaces. And having between predefined xml: and xmlns: handles on one hand and user-defined URI associations on the other is a pretty obvious anomaly.

Onne Says:
December 6th, 2010 at 3:23 pm

many comments already, I’ll try to be short

Why not remove namespaces all together, and simply allow ‘:’ (colon) to be part of a valid name.

Then I would move to “recommend” certain attributes to be interpreted in certain ways:
1. xmlns would be the recommended way to add namespaces to xml
2. type=”number” would be the recommended way to signal the text might be interpreted as a number
3. id=”…” is the recommended way to name and id elements in a xml doc and should be unique
4. href=”…” is the recommended way to create links
5. the map/list idea is good and just “recommend” it

but never enforce any of these

leave whitespace alone, but “recommend” that apps normalize all whitespace and “recommend” a way to signal the intended whitespace removal

no schemas, no doctypes, only utf8 (and maybe “recommend” using unicode=”…” if this appears before any usage of non ascii ranges)

all this “recommending” should map to what the current XML specs and proposals “require”

why? because it is simpler and what people expect, all other semantics can now be left up to higher level APIs to implement, if wanted.

Notice two nice properties:

1. A xml 1.0 compliant API is an example of “a higher level API”, and can be written on top of a xml 2.0 parser (API). However, it does require the xml document to have more strict structure than xml 2.0 sees as valid. (Minus the doctype declarations … bummer, just special case those for “compatibility reasons”.)

2. html5 minus the “how to correct non nested tags” can be written on top of a xml 2.0 parser (API).

Leif Halvard Silli Says:
December 6th, 2010 at 5:16 pm

@Paul Prescod I think I disagree that lack of «out of the box-compatibility» means that it is just rhetorics.

Btw, the relationship of WebSGML to SGML and XML reminds of the relationship of Polyglot Markup to XHTML and HTML5: It has no DTD — something which not all tools are out of the box able to handle (despite that it is within the XML spec), and neither are all tools able to out of the box hold them back from writing <div/> instead of <div></div> (despite that it is within XML/XHTML to do so). In addition, parsers are not fully HTML5-compatible yet either — though they are in some respects more fully compatible with Polyglot HTML5 than with “plain” HTML5).

WhenI I think about it: WebXHTML could have been a good alternative name for Polyglot Markup …

Andy Says:
December 6th, 2010 at 7:22 pm

I’d just drop the ugly syntax of XML and use YAML (the real “YML”) instead. It’s just as flexible(?) as XML but *much* cleaner. Just a dream.

Elliotte Rusty Harold Says:
December 6th, 2010 at 8:49 pm

Like JSON, YAML is demonstrably weaker than XML and cannot handle some of the most important document types; e.g. HTML or anything like it. Mixed content is not a mistake.

Tony Marston Says:
December 7th, 2010 at 6:08 am

The reason that I use CDATA sections and disable-output-escaping is because I generate all my HTML pages via XML and XSL transformations, and sometimes I want to add a fragment of HTML code to my XML document and have it processed as HTML.
– I have to specify CDATA when I add the content to the XML document otherwise ‘<‘ and ‘>’ will be converted to ‘<’ and ‘>’.

– I have to specify ‘disable-output-escaping’ during the XSL transformation otherwise ‘<‘ and ‘>’ will be converted to ‘<’ and ‘>’.

If you can explain how I can have the tags in a data element *NOT* converted to entities, and thus have it processed as HTML instead of plain text *AND* have it work in *ALL* browsers then I would be glad to hear it. Saying that I *shouldn’t* need to do it is not an acceptable answer.

Elliotte Rusty Harold Says:
December 7th, 2010 at 6:44 pm

I’m pretty sure I know what you’re doing, Tony, though I can’t be 100% sure without seeing the actual code in question; and yes, you’re doing it wrong. You’re effectively escaping twice when you should be escaping zero times. Remember, an XSLT stylesheet only contains XML.

Tony Marston Says:
December 8th, 2010 at 5:47 am

The fact that an XSLT stylesheet contains only XML is irrelevant. I have a piece of data that I am inserting into the XML document which is subsequently transformed into HTML, but this piece of data contains some HTML tags which I want displayed as HTML, not data. For example, I may have a piece of text such as “This is a good idea”, and when the HTML document is displayed I want to see the word “good” in bold instead of seeing the and tags.

The problem is not that *I* am escaping the data twice, but that it is being escaped by the software. I create my XML document using the DOM extenson in PHP, and if I use the createTextNode() function it automatically escapes any HTML tags, so I have to use createCDATASection() instead. But then the XSLT processor will automatically escape any HTML tags, includig those in a CDATA section, when it creates the output, which is why I have to use the “disable-output-escaping” option. This works in every browser except Firefox.

If both CDATA sections and disable-output-escaping are removed then how on earth am I supposed to achieve the result I want?

Elliotte Rusty Harold Says:
December 8th, 2010 at 7:49 am

Tony, I’m sorry but you are doing it wrong. You still aren’t giving us full details so the problem could be in the process that generates the XML, or in the process that consumes it; but there is a problem here. Stop using CDATA sections. Stop using disable-output-escaping. You don’t need either one. You likely do need to change your stylesheet to treat the HTML as nodes rather than as text.

XSLT is not intended or designed to treat markup as text. It needs to treat markup as markup. It generates nodes, not text that contains markup. There is no such thing as innerHTML in XSLT (and there shouldn’t be in DOM either, but that’s another discussion).

Chris Arthur Says:
December 8th, 2010 at 8:55 am

You touched on the idea of using XSD defined types for data structures in the XML, but you didn’t quite hit on something I’ve mulled around for a while now. If we treat XML as the instantiation of the data then XSD fits the role of the data model itself. If we hook XSLT and XQuery up to that data model we have the pieces I for one would be looking for and then you could setup generic data structures in XSD that could be used programmatically by XSLT, XQuery, or any outside programming language like Java, C#, or JavaScript. At that point binding XML gets much more interesting because the class equivalent is already present and can be used to simply create a class on the fly by a programming language. Using the newly defined standard library would make representations in XML or data structures uniform and using XML to trade data would be that much easier. I’m sure there are issues with this idea, but its always been a slight sticking point with the way that my company uses XML and related technologies.

Tony Marston Says:
December 9th, 2010 at 1:43 pm

@ Elliotte Rusty Harold

No, I am not doing it wrong, I’m doing what I have to in order to make it work. I have some HTML markup that I want to insert into my XML document so that when it is transformed into HTML the markup is displayed as HTML – i.e. the tags are not changed into entities. The only way I can do this is to use createCDATASection() when I put the data into the XML document and “disable-output-escaping” in the XSL stylesheet otherwise the tags are automagically changed into entities.

Your comment that “XSLT is not intended or designed to treat markup as text” just shows that the design of XSLT was wrong in the first place. There are plenty of users who want to put HTML markup in their XML document and have it displayed as markup and not converted into plain text, and it’s what the users want that matters, not the notion of “purity” that exists in the minds of some designers.

Elliotte Rusty Harold Says:
December 10th, 2010 at 7:12 am

Tony, you still don’t get it. You can do what you’re trying to do without all the hassles you’re going through, but you have to do it the right way. Your stylesheet and/or input document are not written correctly to produce the outputs you’re trying to get from the inputs you have. Your stylesheet is buggy. XSLT does not change HTML tags into entities. That you think it does just shows that you don’t understand what’s in your document/stylesheet in the first place. It is absolutely possible to put HTML markup into an XML document and have it displayed as markup and not converted into plain text when transformed by XSLT. You just haven’t done that. 50% chance you’ve put plain text into the XML document that is transformed into plain text when converted by XSLT. 50% chance you’ve put plain text into the XSLT stylesheet that is output as plain text. Either way, that’s your problem, not any perceived design flaw in XSLT or browsers. (And bringing this back on topic if you didn’t have CDATA sections you wouldn’t have been able to make this mistake in the first place. The language would have forced you to find the correct solution instead.)

Tony Marston Says:
December 10th, 2010 at 11:49 am

You obviously haven’t read what I wrote. When I use PHP’s createTextNode() function to add text to my XML document it automatically converts/escapes any HTML tags into entities. The only way I can avoid this is to use the createCDATASection() function instead. This is in accordance with the XML specification.

Even if my XML document does contain any unescaped HTML the XSLT processor, whether it is server-side or client-side, will STILL escape it UNLESS I use the “disable-output-escaping” option. This again is in accordance with the XSLT specification.

Unless you can supply me with code which does NOT cause html to be escaped, either on input to the XML document, or during the XSL transformation process, then I’m afraid that it is you who is wrong, not me.

Elliotte Rusty Harold Says:
December 10th, 2010 at 5:46 pm

Post complete details including sample code on xsl-list or a PHP and 20 people can show you exactly what you’re doing wrong. But it is behaving as intended. Why are you calling createTextNode() when you want non-text? You told the API the content was text, not markup, and that’s what you got. If you want elements, call createElement(). Don’t call createText().

Tony Marston Says:
December 11th, 2010 at 9:43 am

This is how I add an element into my XML document:

$occ = $xml_doc->createElement(‘footer’);
$occ = $root->appendChild($occ);

To add data to this element I use one of the following:

either: $child = $xml_doc->createTextNode($data);
or : $child = $xml_doc->createCDATASection($data);

followed by: $child = $occ->appendChild($child);

If I use createTextNode() the data is escaped. If I use createCDATASection() it is not escaped.

Within my XSL stylesheet I output the element data as follows:

Even if the element data exists in a CDATA section it is still escaped unless I do this:

If you can tell me how to get the XSL transformation to output an element’s data without escaping the HTML tags and without using disable-output-escaping then I will listen to what you have to say. Until you can furnish me with the “right” method I cannot accept that what I am doing is “wrong”. It is the only method in either the XML or XSL documentation which works. If there is another method, then WHERE IS IT DOCUMENTED?

Tony Marston Says:
December 11th, 2010 at 9:47 am

As usual this blog software cannot deal with ‘<‘ and ‘>’ characters. The XSL commands I attempted to add to my previous post were:

<xsl:value-of select=”/root/footer” />

and

<xsl:value-of select=”/root/footer” disable-output-escaping=”yes”/>

Elliotte Rusty Harold Says:
December 11th, 2010 at 10:14 am

That is why I suggest you use a mailing list instead.

I already told you: don’t use createTextNode to create markup. It doesn’t do that. It creates text, and that’s what you get. You still haven’t given me a complete working example, but the short version is, if you want to create the HTML text then you have to use createElement or equivalent. E.g.

$span = $xml_doc->createElementNode("span");
$text = $xml_doc->createText("text");
$span->appendChild($text);

I haven’t done PHP in a while so I may be off on a detail or two, but that should be close. What you’re doing wrong, and what you seem to have a hard time believing is wrong, is this:

 $span = $xml_doc->createTextNode("text");

That does not tell DOM to create an element named span. It tells it to create a text node whose first character is < and that’s exactly what you get. Now you could take that text and output it using a pure text-to-text process if that text is all you need, and no escaping would occur. You can treatt HTML as pure text. But you seem to be piping that into an XSLT process that also generates some of your HTML, so you want to mix HTML created by the node-based XSLT with text fragments of HTML, and that’s just going to work. You’re mixing two different models: one based on nodes and one based on strings, and trying to concatenate strings to nodes and put nodes inside strings, and it’s just a mess. The processor has no way of knowing which is which at this point. No wonder you’re having problems.

(And I still don’t see how CDATA sections fit in here. Based on what you’ve said, they should make no difference at all. In fact, maybe they don’t. Does your pipeline actually break if you use createTextNode instead of createCDATAsection or are you just bothered by seeing escaped markup? If you’re passing the output of PHP straight into XSLT, I don’t see how that could make any difference at all because XSLT does not treat CDATA sections differently than escaped text.)

As to where all this is documented, that would be the DOM specification, as implemented by the PHP API you’re using which explains what a text node is and what the createTextNode method does.

Tony Marston Says:
December 12th, 2010 at 6:29 am

The code sample I rovied earlier showed that I was creating an element called ‘footer’ and giving it content from a variable caled $data. What could be simpler than that?

There is no createElementNode() function, it is called createElement(), and this creates an element name without any data, attributes or child elements.

There is no createText() function, it is called createTextNode(), and this adds text to the specified element. The data is automatically escaped, but if I use createCDATASection() instead of createTextNode() it is not escaped.

Regardless of how I insert the data into my XML document it will be escaped during the XSL transformation unless I use the “disable-output-escaping” option.

I do not want my data to be escaped either when it is inserted into the XML document or during the XSL transformation, which is why I am using both CDATA and d-o-e.

You still haven’t told me how I can prevent my data from being escaped during the XSL transformation without using “disable-output-escaping”.

Lorenzo Gatti Says:
December 13th, 2010 at 8:52 am

@Tony
What you are missing is that the CDATA section itself is a form of escaping, wholesale instead of character by character: the result, except for silly parsers insisting that you have a “CDATA node” instead of a “text node”, is identical for all practical (and theoretical) purposes.

You are still improperly treating markup as text that needs escaping rather than as markup that can be left as is; if you build DOM trees you necessarily introduce escaping when you serialize text or CDATA nodes.

On the other hand, disable-output-escaping is an useful hack offered by XSLT in order to support your special, minority need (islands of raw string concatenation in XML document trees): why don’t you want to use it?

If you know your $data string is an appropriate fragment of XML (I wouldn’t be particularly sure), you should concatenate it with other strings, without involving XSLT or DOM functions, obtaining a complete XML document. The right way to process $data=”Mixed content and proud of it, man”

(in practice you would retrieve $data from a database) is the simplest one: a template like ${data}</p&gt

You might want to parse the result to verify that it’s well-formed.

Tony Marston Says:
December 14th, 2010 at 5:15 am

@ Lorenzo Gatti

> On the other hand, disable-output-escaping is an useful hack offered by XSLT …
> why don’t you want to use it?

I *AM* already using it, but this blog post is all about *NOT* using it, so I want to know what the alternative is. I am putting an HTML fragment into an XML document and transforming it into an HTML document via an XSL stylesheet, but I do not want the HTML tags escaped into ‘<’ and ‘>’, I want them left as ‘<‘ and ‘>’.

Elliotte Rusty Harold Says:
December 14th, 2010 at 7:58 am

Contrary to your claims, you are *not* putting an HTML fragment into an XML document. You are putting plain text that happens to contain some some < and > characters that superficially look like tags into an XML document. That is why it comes out the other end escaped. The only way to put HTML into an XML document using standard DOM is to insert the actual elements. You are not doing that.

Tony Marston Says:
December 15th, 2010 at 10:21 am

A piece of text which is not a complete HTML document but which contains a fragment of an HTML document (insofar as it contains a number of HTML tags) is, IMHO, an HTML fragment.

You still have failed to tell me how I can insert a string containing HTML tags into an XML document without having those tags escaped during the XSL transformation without the use of the “disable-output-escaping” option.

Until you can identify the “correct” way to achieve this it is pretty pointless telling me that my current methods are incorrect.

Lorenzo Gatti Says:
December 16th, 2010 at 3:39 am

Besides XSL with disable-output-escaping, which looks like a perfect fit for your needs and doesn’t seem inconvenient or undesirable at all, the correct methods are treating everything as XML, which you are unwilling or unable to do, or treating everything as a string, which I tried to suggest as a somewhat unsafe (for the risk of creating malformed or malicious documents) but natural and very simple approach in PHP:

<html>

<body>

<?php $data='Mixed content from a database’;?>

<?php echo $data;?>

</html>

Tony Marston Says:
December 18th, 2010 at 5:25 am

That approach will not work when the XSL transformations are performed by the client.

All the data is supposed to be contained in the XML document, and the XSL transformation is supposed to produce an HTML document from that data. Whether you like it or not there are instances where I *DO NOT* want the data to be escaped – I want any HTML tags treated as HTML tags and not escaped into ‘<’ and ‘>’.

Ramayya Says:
January 8th, 2011 at 5:13 pm

May I suggest developing a Concept Map of XML 2.0 to help all the people involved in developing XML 2.0 to have the same Mental Map XML 2.0?

Here is a freeware tool that can be used to collaboratively develop the Concept Map —

Ramayya Says:
January 8th, 2011 at 5:16 pm

For some reason the URL to the tool did not show up in my previous post. The tool is called IHMC Cmap Tool —

http://cmap.ihmc.us/.

I am enclosing the graphic from IHMC Website for your reference —

http://cmap.ihmc.us/CmapTools%20-%20Home%20Page%20Cmap.jpg

Kenneth Bernholm Says:
January 16th, 2011 at 2:55 am

@Tony: I think I have the solution to your situation. It may be a nasty little hack, but it works for me.

I have an XML/XSL-based site where users can enter text in html textarea forms. This text may contain html tags and is stored verbatim in the database. When I use PHP to build my XML I fetch the text/html from the database and places it in my XML DOM using createCDATAsection. I write my XML DOM to the browser using saveXML.
In the XSL I use output method=”html” and xsl:copy-of to place the element containing the text/html. So far everything is proper but since my XML has the CDATA-tags around the text/html, I’ll get html entities (same as you I believe). My tricks for solving this is:
$xml=$dom->saveXML();
$xml = str_replace( ‘<![CDATA[‘, ”, $xml );
$xml = str_replace( ‘]]>’, ”, $xml );
echo $xml;
By simply removing the CDATA delimiters I get XML with non-escaped html embedded in the XML which the XSL parses nicely into html. Ugly but functional.

Warning: The user has to submit proper xhtml because html with unterminated tags makes the process go boom! So you need to check the users input for xhtml-errors. The easiest way to do this is to use one of the free web editors for embedding in a web page. Several of these cleans up the input automatically and stores it as either html or xhtml. (If your data comes from somewhere else this is irrelevant to you.)

Hope this helps a bit 🙂

Tony Marston Says:
January 17th, 2011 at 1:33 am

I don’t like hacking the code to make it work. If the new standard removes functionality that I need then the solution is quite simple – I shan’t use the new standard.

Kenneth Bernholm Says:
January 17th, 2011 at 2:01 am

@Tony: I agree with you about hacking the code. I don’t like that either. The problem in PHP is that createTextNode insists on converting bracket characters to HTML entities. This is fine in itself but not when you don’t want your brackets converted. A simple flag (i.e. NO_BRACKET_CONVERT) on the createTextNode function would solve the problem.
Using the createCDATAsection function and then removing the CDATA delimiters is quite a simple hack and it does the job. Personally I find hacks of this size acceptable (albeit unwanted) as long as I know exactly why and where I’m using them. Also I’ll keep looking for a better solution. If you find one please post it.

Tony Marston Says:
January 25th, 2011 at 1:26 am

If I have a CDATA element in my XML file where the ‘<‘ and ‘>’ characters have *NOT* been converted into HTML entities then why can’t the XSLT processor leave the data alone and output it “as-is” without escaping it? If I wanted the data escaped then I would ensure that it was escaped when I put it into the XML file. Only one of XML/XSLT should perform the escaping, not both. Not being able to turn escaping off is a totally bad idea.

Bill Goggin Says:
February 13th, 2011 at 2:31 pm

I have a use case that I think makes CDATA desirable for XSLT transformations. I’m generating HTML email bodies from an XML file containing data that varies per email. However, the headers, footers and some other content needs to vary by brand. Brands are discriminated by an element in the input XML. Much of the HTML that varies by brand contains unbalanced HTML tags that are not well formed XML. The brand info needs to be edited fairly often by HTML coders, so character escaping each angle bracket won’t work very well. CDATA at least lets the fragment look like HTML to humans. I do not have editorial control of the HTML so I cannot force the brand specific pieces of HTML into balanced well formed XML fragments. I think I don’t have a choice but to treat the fragments as text instead of markup. Well, I could theoretically use a different templating technology, but in reality I don’t get to make that decision. Given my constraints, am I doing it wrong?

I think I’m understanding ERH’s argument that it is better to build up HTML output node by node, but I don’t see how I can do that given my inability to control the HTML, but maybe I’m really not getting it. I suspect problems like this are not rare.

Elliotte Rusty Harold Says:
February 13th, 2011 at 5:57 pm

Yes, Bill you’re doing it wrong. Why does the HTML that varies by brand contains unbalanced HTML tags that are not well formed XML? I see no reason that has to be, or should be so. Make that markup well-formed and your problems with this vanish.

Bill Goggin Says:
February 21st, 2011 at 12:09 am

@ERH I need to insert text within paragraph tags. I have to open the paragraph in one template and close in another. I won’t try to insert tags here with all the escaping, but suppose the double quotes here are a set of paragraph tags, “This document is issued to [firstName] [lastName] and is intended only for them.” Except for the first name and last name, the entire sentence would be part of the brand information. Other brands might use something completely different. I think I need to first insert “This document is issued to” with the opening paragraph tag, my per email first name and last name, then “and is intended only for them” with the closing paragraph tag. CSS to mimic paragraphs doesn’t seem like a good solution because email clients have uneven support.

I hope and suspect I’m missing some really obvious way to interrupt a paragraph in HTML and I’m going to say “of course” and curse my mental block when I see your reply. While CDATA sort of solves my problem, I’d like to have a better solution.

Thanks to you and the other commenters for an interesting thread of discussion.

Bill Goggin Says:
February 21st, 2011 at 8:51 pm

@ERH I see it now. I can just create the paragraph, then 2 text nodes inside. I don’t know why it was so hard to see this earlier, but I knew it had to be something I was just looking at the wrong way.