It’s often convenient to divide long XML documents into multiple files. The classic example is a book, customarily divided in chapters. Each chapter may be further subdivided into sections. Traditionally this has implemented via external entity references. For example,
<?xml version="1.0"?> <!DOCTYPE book SYSTEM "book.dtd"[ <!ENTITY chapter1 SYSTEM "malapropisms.xml"> <!ENTITY chapter2 SYSTEM "mispronunciations.xml"> <!ENTITY chapter3 SYSTEM "madeupwords.xml"> ]> <book> <title>The Wit and Wisdom of George W. Bush</title> &chapter1; &chapter2; &chapter3; </book>
However, external entity references have a number of limitations. Among them:
The individual component files cannot be treated in isolation. They often aren’t themselves full, well-formed XML documents. They cannot have document type declarations.
The document must have a DTD, and the parser must read the DTD. Not all parsers do.
If any of the pieces are missing, then the entire document is malformed. There’s no option for error recovery.
Only entire files can be included. You can’t include just one paragraph from a document.
There’s no way to include unparsed text such as an example Java program or XML document in a technical book. Only well-formed XML can be included, and all such XML is parsed. (SGML actually had this ability, but it was one of the features XML removed in the process of simplification.)
XInclude is an emerging specification from the W3C that endeavors to create a mechanism for building large XML documents out of their component parts which does not have these limitations. XInclude can combine multiple documents and parts thereof independently of validation. Each piece can be a complete XML document, a part of an XML document, or a non-XML text document like a Java program or an e-mail message.
XInclude defines a single include element in the
http://www.w3.org/2001/XInclude namespace. This can be mapped to any prefix though
xi is customary. (In the remainder of this article, I will simply assume the
xi prefix has been bound to the correct namespace URI without further comment.) Each
xi:include element has an
href attribute that contains a URL pointing to the file to include. For example, using XIncludes instead of external entity references, the previous book example can be rewritten like this:
<?xml version="1.0"?> <book xmlns:xi="http://www.w3.org/2001/XInclude"> <title>The Wit and Wisdom of George W. Bush</title> <xi:include href="malapropisms.xml"/> <xi:include href="mispronunciations.xml"/> <xi:include href="madeupwords.xml"/> </book>` Of course you can also use absolute URLs where appropriate: <?xml version="1.0"?> <book xmlns:xi="http://www.w3.org/2001/XInclude"> <title>The Wit and Wisdom of George W. Bush</title> <xi:include href="http://www.whitehouse.gov/malapropisms.xml"/> <xi:include href="http://www.whitehouse.gov/mispronunciations.xml"/> <xi:include href="http://www.whitehouse.gov/madeupwords.xml"/> </book>
XInclude processing is recursive. That is, an included document can itself include another document. For example, a book might be divided into front matter, back matter, and several parts:
<?xml version="1.0"?> <book xmlns:xi="http://www.w3.org/2001/XInclude"> <xi:include href="frontmatter.xml"/> <xi:include href="part1.xml"/> <xi:include href="part2.xml"/> <xi:include href="part3.xml"/> <xi:include href="backmatter.xml"/> </book>
Each part might be further divided into a part intro and several chapters:
<?xml version="1.0"?> <part xmlns:xi="http://www.w3.org/2001/XInclude"> <xi:include href="intro1.xml"/> <xi:include href="chapter_1.xml"/> <xi:include href="chapter_2.xml"/> <xi:include href="chapter_3.xml"/> <xi:include href="chapter_4.xml"/> </part>
There’s no limit to how deep this can go. Only circular inclusion (Document A includes Document B which includes, directly or indirectly, Document A) is forbidden. When an XInclude processor reads an XML document it resolves all references and returns a document that contains no XInclude elements. XInclusion is not part of XML 1.0 or the XML Information Set (Infoset). Thus to actually understand such a document, you’ll normally need to pass it through an XInclude processor that replaces the
xi:include elements with the documents they point to. This may be done automatically by a server side process or it might be done on the client side by an XInclude aware browser. It may be hooked into a custom SAX program using a SAX filter that resolves the XIncludes. It may even be an option for parser resolution. However it does not happen automatically. If you want it, install the necessary software ands then explicitly tell the software to resolve the XInclude elements. The Gnome Projectâ€™s libxml and my own XOM both support XInclude. For example, if youâ€™re using the xmllint tool bundled with
libxml, you specify the
--xinclude flag resolve the include elements like this:
$ xmllint --xinclude book.xml <?xml version="1.0"?> <book xmlns:xi="http://www.w3.org/2001/XInclude"> <title>The Wit and Wisdom of George W. Bush</title> <preface> â€¦
Of course, there are APIs you can call from your own code, as well as programs to run from the command line. For instance, this code fragment resolves all the include elements in a XOM
Document object and returns a new document that contains all the included content:
Document resolveDocument = XIncluder.resolveInPlace(inputDocument);
Technical articles like this one often include example code: Java and C programs, XML and HTML documents, e-mail messages and text files, and so forth. Within these examples characters like < and & should be understood as raw text rather than parsed as markup. You can indicate that you want a particular included document to be treated as text by adding a
parse="text" attribute to the
xi:include element. For example, this fragment loads the source code for the Java program
SpellChecker.java from the
examples directory into a code element:
<code> <xi:include parse="text" href="examples/SpellChecker.java" /> </code>
Processes that are downstream from the XInclusion will see the complete text of the file
SpellChecker.java like they would any other text. For instance, such data would be passed to a SAX
characters() method. This is pretty much exactly the same way a parser would treat the content if it were typed in a CDATA section.
The XInclude processor will attempt to determine the character encoding of the text file from any available metadata, such as a charset parameter in the included document’s MIME type. If the document is an XML document, then the processor will next try to use the byte order mark, the encoding declaration and the other customary heuristics for determining the character encoding of an XML document. If neither of these is suitable, the character set can be specified explicitly by an encoding attribute using the same names used for the encoding declaration in an XML document. For example, this element includes a file that’s written in Latin-1:
<xi:include parse="text" encoding="ISO-8859-1" href="examples/SpellChecker.java" />
If none of these options are available, then the processor assumes the document is written in UTF-8.
Servers crash. Network connections fail. The DNS system gets congested. For all these reasons and more, documents included from remote servers may be temporarily unavailable. The default action for an XInclude processor in such a case is simply to give up and report a fatal error. However, the
xi:include element may contain an
xi:fallback element that contains alternate content to be used if the requested resource cannot be found. For example, this
xi:include element tries to load the file at http://www.whitehouse.gov/malapropisms.xml. However, if somebody deletes that file, then it provides some literal content instead:
<xi:include href="http://www.whitehouse.gov/malapropisms.xml"> <xi:fallback> <para> Our enemies are innovative and resourceful, and so are we. They never stop thinking about new ways to harm our country and our people, and neither do we. </para> </xi:fallback> </xi:include>
xi:fallback element can even include another
xi:include element. For example, this
xi:include element begins by attempting to include the document at http://www.whitehouse.gov/malapropisms.xml. However, if somebody deletes that file, then it will try http://politics.slate.msn.com/default.aspx?id=76886 instead.
<xi:include href="http://www.whitehouse.gov/malapropisms.xml"> <xi:fallback> <xi:include href = "http://politics.slate.msn.com/default.aspx?id=76886 l" /> </xi:fallback> </xi:include>
xi:fallback element is not used if the document can be located but is malformed. That is always a fatal error.
Include elements can contain other content besides the single
xi:fallback element. For example, this
xi:include element contains a
xi:fallback and a
<xi:include href="http://www.whitehouse.gov/malapropisms.xml"> <para> Well, I think if you say you're going to do something and don't do it, that's trustworthiness. </para> <xi:fallback> <xi:include href="http://politics.slate.msn.com/default.aspx?id=76886l"/> </xi:fallback> </xi:include>
However, the processor will ignore all such content. When the
xi:include element is replaced, the
para element will silently vanish.
The URLs used in XInclude
href attributes can have XPointer fragment identifiers. If so they only include those parts of the external document selected by the XPointer. For example, this XPointer includes only the
malapropism elements from the document bushisms.xml:
<xi:include href="bushisms.xml#xpointer(//malapropism)" />
Since XPointers can point up, down, and sideways in an XML document, and do not necessarily select a contiguous region of a document, they present significant problems for streaming applications and APIs like SAX, XNI, and StAX. Full XInclude with XPointer support really requires a tree-based API such as DOM or XOM, and can be expected to use at least as much memory as the sum of all the documents combined together.
Validation and other processes
One of the most common questions about XInclude is how inclusion interacts with validation, XSL transformation, and other processes that may be applied to an XML document. The short answer is that it doesn’t. XInclusion is not part of any other XML process. It is a separate step which you may or may not perform when and where it is useful to you.
For example, consider validation against a schema. A document can be validated before or after inclusion, or both, or neither. If you validate the document before the
xi:include elements are replaced, then the schema has to declare the
xi:include elements just like it would declare any other element. If you validate the document after the
xi:include elements are replaced, then the schema has to declare the replacement elements. You can even write a single schema that covers both cases, by using a choice to permit an element to contain either an
xi:include element or its replacement elements.
For another example, consider XSL transformation. XSLT was defined several years before XInclude. The XSLT algorithm operates on well-formed XML documents. An XSLT processor acts on
xi:include elements exactly like it acts on any other element; that is, it finds a template rule that matches elements with the local name include in the http://www.w3.org/2001/XInclude namespace and instantiates that rule’s template. It does not automatically replace the
xi:include elements. Of course, if you want the
xi:include elements to be replaced before the stylesheet is applied, you can first use an XInclude processor to resolve the includes and generate a new XML document, then pass the new document to the XSLT processor along with the stylesheet. You can even resolve the includes, pass the merged document to the XSLT processor for transformation, and then resolve includes again on the output of the transformation in case the stylesheet inserted any new
xi:include elements. Inclusion and transformation are separate and orthogonal processes that can be performed in whichever order is convenient in the local environment. There is no canonical processing model for XML.
You cannot simply place include elements in a document and expect them to resolved automatically. There’s always an extra step where you tell some piece of software somewhere to resolve the XIncludes. Depending on the environment this may be a command line flag, an option in a config file, or a separate program you run manually. However assuming you can do that, XInclude is a very useful technique for authoring large documents in multiple, smaller, more manageable pieces.
To Learn More
The canonical definition of XInclude is of course the XInclude specification itself. The current version is a proposed recommendation, but I don’t expect the final version to be hugely different. XInclude is covered in a little more depth in Chapter 12 of XML in a Nutshell (3rd Edition) and Chapter 19 of the XML 1.1 Bible.