TagSoup

Here’s part 13 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

John Cowan’s TagSoup (http://home.ccil.org/~cowan/XML/tagsoup/) is an open source HTML parser written in Java that implements the Simple API for XML, or SAX. Cowan describes TagSoup as “a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.”

TagSoup is not intended as an end-user tool, but it does have a basic command-line interface. It’s also straightforward to hook it up to any number of XML tools that accept input from SAX. Once you’ve done that, feed in HTML, and out will come well-formed XHTML. For example:

$ java -jar tagsoup.jar index.html
<?xml version="1.0" standalone="yes"?>
<html lang="en-US" xmlns="http://www.w3.org/1999/xhtml"><head><title>Java Virtual Machines</title><meta name="description" content="A Growing
list of Java virtual machines and their capabilities">
</meta></head><body bgcolor="#ffffff" text="#000000">

<h1 align="center">Java Virtual Machines</h1>
…

You can improve its output a little bit by adding the --omit-xml-declaration and --nodefaults command-line options:

$ java -jar tagsoup.jar --omit-xml-declaration --nodefaults index.html
<html lang="en-US" xmlns="http://www.w3.org/1999/xhtml"><head><title>Java Virtual Machines</title><meta name="description" content="A Growing
list of Java virtual machines and their capabilities"></meta>
</head><body bgcolor="#ffffff" text="#000000">

<h1 align="center">Java Virtual Machines</h1>
…

This will remove a few pieces that are likely to confuse one browser or another.

You can use the --encoding option to specify the character encoding of the input document. For example, if you know the document is written in Latin-1, ISO 8859-1, you could run it like so:

$ java -jar tagsoup.jar –encoding=ISO-8859-1 index.html

TagSoup’s output is always UTF-8.

Finally, you can use the --files option to write new copies of the input files with the extension .xhtml. Otherwise, TagSoup prints the output on stdout, from where you can redirect it to any convenient location. TagSoup cannot change a file in place, like Tidy can.

However, TagSoup is primarily designed for use as a library. Its output from command-line mode leaves something to be desired compared to Tidy. In particular:

  • It does not convert presentational markup to CSS.
  • It does not include a DOCTYPE declaration, which is needed before some browsers will recognize XHTML.
  • It does include an XML declaration, which needlessly confuses older browsers.
  • It uses start-tag and end-tag pairs for empty elements such as br and hr, which may confuse some older browsers.

TagSoup does not guarantee absolutely valid XHTML (though it does guarantee well-formedness). There are a few things it cannot handle. Most important, XHTML requires all img elements to have an alt attribute. If the alt attribute is empty, the image is purely presentational and should be ignored by screen readers. If the attribute is not empty, it is used in place of the image by screen readers. TagSoup has no way of knowing whether any given img with an omitted alt attribute is presentational or not, so it does not insert any such attributes. Similarly, TagSoup does not add summaries to tables. You’ll have to do that by hand, and you’ll want to validate after using TagSoup to make sure you catch all these instances.

However, despite these limits, TagSoup does do a huge amount of work for you at very little cost.

5 Responses to “TagSoup”

  1. John Cowan Says:

    I’d also point out the existence of TSaxon (TagSoup + XSLT) and the fact that, unlike Tidy, TagSoup can’t loop no matter how bad the input is — you always get output. This makes it extremely robust in a pipeline that needs to process billyuns and billyuns of documents.

  2. Ewa Says:

    You can also use TagSoup in XMLSpy. Here is a blog entry which describes how to integrate TagSoup as external tool in the XML editor: http://www.spycomponents.com/sphpblog/index.php?m=07&y=08&entry=entry080702-211603

    Regards

  3. SpudTater Says:

    > It uses start-tag and end-tag pairs for empty elements such as br and hr, which may confuse some older browsers

    Not only that, but this is invalid according to the W3C validator. Does anybody know if this behaviour can be changed when using TagSoup as a library, and if so, how?

  4. naveen Says:

    hi, am unable to access the file tagsoup-1.2.jar

    the command that i gave is
    c:\>java -jar tagsoup-1.2.jar –html test.html

    i also tried with
    c:\>java -jar tagsoup-1.2 –html test.html

    the jar file is inside java /lib/tagsoup-1.2.jar

    how do i start working with tagsoup??

  5. Timothy (TRiG) Says:

    How does it compare with HTML Purifier?

    TRiG.