John Cowan’s TagSoup (http://home.ccil.org/~cowan/XML/tagsoup/) is an open source HTML parser written in Java that implements the Simple API for XML, or SAX. Cowan describes TagSoup as “a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.”
TagSoup is not intended as an end-user tool, but it does have a basic command-line interface. It’s also straightforward to hook it up to any number of XML tools that accept input from SAX. Once you’ve done that, feed in HTML, and out will come well-formed XHTML. For example:
$ java -jar tagsoup.jar index.html <?xml version="1.0" standalone="yes"?> <html lang="en-US" xmlns="http://www.w3.org/1999/xhtml"><head><title>Java Virtual Machines</title><meta name="description" content="A Growing list of Java virtual machines and their capabilities"> </meta></head><body bgcolor="#ffffff" text="#000000"> <h1 align="center">Java Virtual Machines</h1> …
You can improve its output a little bit by adding the
--nodefaults command-line options:
$ java -jar tagsoup.jar --omit-xml-declaration --nodefaults index.html <html lang="en-US" xmlns="http://www.w3.org/1999/xhtml"><head><title>Java Virtual Machines</title><meta name="description" content="A Growing list of Java virtual machines and their capabilities"></meta> </head><body bgcolor="#ffffff" text="#000000"> <h1 align="center">Java Virtual Machines</h1> …
This will remove a few pieces that are likely to confuse one browser or another.
You can use the
--encoding option to specify the character encoding of the input document. For example, if you know the document is written in Latin-1, ISO 8859-1, you could run it like so:
$ java -jar tagsoup.jar –encoding=ISO-8859-1 index.html
TagSoup’s output is always UTF-8.
Finally, you can use the
--files option to write new copies of the input files with the extension .xhtml. Otherwise, TagSoup prints the output on stdout, from where you can redirect it to any convenient location. TagSoup cannot change a file in place, like Tidy can.
However, TagSoup is primarily designed for use as a library. Its output from command-line mode leaves something to be desired compared to Tidy. In particular:
- It does not convert presentational markup to CSS.
- It does not include a DOCTYPE declaration, which is needed before some browsers will recognize XHTML.
- It does include an XML declaration, which needlessly confuses older browsers.
- It uses start-tag and end-tag pairs for empty elements such as br and hr, which may confuse some older browsers.
TagSoup does not guarantee absolutely valid XHTML (though it does guarantee well-formedness). There are a few things it cannot handle. Most important, XHTML requires all img elements to have an alt attribute. If the alt attribute is empty, the image is purely presentational and should be ignored by screen readers. If the attribute is not empty, it is used in place of the image by screen readers. TagSoup has no way of knowing whether any given img with an omitted alt attribute is presentational or not, so it does not insert any such attributes. Similarly, TagSoup does not add summaries to tables. You’ll have to do that by hand, and you’ll want to validate after using TagSoup to make sure you catch all these instances.
However, despite these limits, TagSoup does do a huge amount of work for you at very little cost.