Tidy

Here’s part 13 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

Regular expressions are well and good for individual, custom changes, but they can be tedious and difficult to use for large quantities of changes. In particular, they are designed more to work with plain text than with semistructured HTML text. For batch changes and automated corrections of common mistakes, it helps to have tools that take advantage of the markup in HTML. The first such tool is Dave Raggett’s Tidy (www.w3.org/People/Raggett/tidy/), the original HTML fixer-upper. It’s a simple, multiplatform command-line program that can correct most HTML mistakes.

-asxhtml

For purposes of this book, you want to use Tidy with the -asxhtml command-line option. For example, this command converts the file index.html to well-formed XHTML and stores the result back into the same file (-m):

$tidy -asxhtml -m index.html Frankly, you could do worse than just running Tidy across all your HTML files and calling it a day; but please don’t stop reading just yet. Tidy has a few more options that can improve your code further, and there are some problems it can’t handle or handles incorrectly. For example, when I used this command on one of my older pages that I hadn’t looked at in at least five years, Tidy generated the following error messages: line 1 column 1 - Warning: missing <!DOCTYPE> declaration line 7 column 1 - Warning: <body> attribute "bgcolor" has invalid value "#fffffff" line 16 column 2 - Warning: <table> lacks "summary" attribute line 230 column 1 - Warning: <table> lacks "summary" attribute line 14 column 91 - Warning: trimming empty <p> Info: Document content looks like XHTML 1.0 Transitional 5 warnings, 0 errors were found!  These are problems that Tidy mostly didn’t know how to fix. It actually was able to supply a DOCTYPE because I specified XHTML mode, which has a known DOCTYPE. However, it doesn’t know what to do with bgcolor="#fffffff". The problem here is an extra f which should be removed, or perhaps the entire bgcolor attribute should be removed and replaced with CSS. Tip Once you’ve identified a problem such as this, it’s entirely possible that the same problem crops up in multiple documents. Having noticed this in one file, it’s worth doing a search and replace across the entire directory tree to catch any other occurrences. Given the prevalence of copy and paste coding, few mistakes occur only once. The second two problems are tables that lack a summary attribute. This is an accessibility problem, and you should correct it. Tidy actually prints some further details about this: The table summary attribute should be used to describe the table structure. It is very helpful for people using non-visual browsers. The scope and headers attributes for table cells are useful for specifying which headers apply to each table cell, enabling non-visual browsers to provide a meaningful context for each cell. For further advice on how to make your pages accessible see http://www.w3.org/WAI/GL. You may also want to try "http://www.cast.org/bobby/" which is a free Web-based service for checking URLs for accessibility.  Table summaries are certainly a good thing, and you should certainly add one. However, Tidy can’t summarize your table itself. You’ll have to do that. The final message warns you that Tidy found an empty paragraph element and threw it away. This common message is probably misleading and definitely worth a second look. What it meant in this case (and in almost every other one you’ll see) is that the <P> tag was being used as an end-tag rather than a start-tag. That is, the markup looked like this: Blah blah blah<P> Blah blah blah<P> Blah blah blah<P>  Tidy reads it as this: Blah blah blah <P>Blah blah blah</P> <P>Blah blah blah</P> <P></P>  Thus, it throws away that last empty paragraph. However, what you almost certainly wanted was this: <P>Blah blah blah</P> <P>Blah blah blah</P> <P>Blah blah blah</P>  There’s no easy search and replace for this flaw, though XHTML strict validation will at least alert you to the files in which the problem lies. You can use XSLT (discussed shortly) to fix some of these problems; but if there aren’t too many of them, it’s safer and not hugely onerous to manually edit these files. If you specify the --enclose-text yes option, Tidy will wrap any such unparented text in a p element. For example: $ tidy -asxhtml --enclose-text yes example.html 

Tidy may alert you to a few other serious problems that you’ll still have to fix manually. These include

• A missing end quote for an attribute value; for example, <p id="c1>
• A missing > to close a tag; for example, <p or </p
• Misspelled element and attribute names; for example, <tabel> instead of <table>

-clean

The next important option is -clean. This replaces deprecated presentational elements such as i and font with CSS markup. For example, when I used -clean on the same document, Tidy added these CSS rules to the head of my document:

<style type="text/css">
/*<![CDATA[*/
body {
background-color: #fffffff;
color: #000000;
}
p.c2 {text-align: center}
h1.c1 {text-align: center}
/*]]>*/
</style>


It also added the necessary class attributes to the elements that had previously used a center element. For example:

<h1 class="c1">Java Virtual Machines</h1>

That’s as far as Tidy goes with CSS, but you might want to consider revisiting all these files to see if you can extract a common external stylesheet that can be shared among the documents on your site, rather than including the rules in each separate page. Furthermore, you should probably consider more semantic class names rather than Tidy’s somewhat prosaic default.

For example, I’ve sometimes used the i element to indicate that I’m talking about a word rather than using the word. For example:

<p><i>May</i> can be very ambiguous in English, meaning might,
can, or allowed, depending on context.</p>

Here the italics are not used for emphasis, and replacing them with em is not appropriate. Instead, they should be indicated with a span and a class, like so:

<p><span class='wordasword'>May</span> can be very ambiguous
in English, meaning might, can, or allowed, depending on
context.</p>


Then, add a CSS rule to the stylesheet to format this as italic:

span.wordasword { font-style: italic; } 

Tidy, however, is not smart enough to figure this out, and will need your help. Still, Tidy is a good first step.

Encodings

Tidy is surprisingly bad at detecting the character encoding of HTML documents, despite the relatively rich metadata in most HTML documents for specifying exactly this. If your content is anything other than ASCII or ISO-8859-1 (Latin-1), you’d best tell Tidy that with the –input-encoding option. For example, if you’ve saved your documents in UTF-8, invoke Tidy thusly:

$tidy -asxhtml -–input-encoding utf8 index.html  Tidy generates ASCII text as output unless you tell it otherwise. It will escape non-ASCII characters using named entities when available, and numeric character references when not. However, Tidy supports several other common encodings. The only other one I recommend is UTF-8. To get it use the --output-encoding option: $ tidy -asxhtml -–output-encoding utf8 index.html

The input encoding does not have to be the same as the output encoding. However, if it is you can just specify -utf8 instead:

Generated Code

Tidy has limited support for working on PHP, JSP, and ASP pages. Basically, it will ignore the content inside the PHP, ASP, or JSP sections and try to work on the rest of the HTML markup. However, that is very tricky to do. In particular, most templating languages do not respect element boundaries. If code is generating half of an element or a start-tag for an element that is later closed by a literal end-tag, it is very easy for Tidy to get confused. I do not recommend using Tidy directly on these sorts of pages.

Instead, download some fully rendered pages from your web site after processing by the template engine. Run Tidy on a representative sample of these pages, and then compare the results to the original pages. By looking at the differences, you can usually figure out what needs to be changed in your templates; then make the changes manually.

Although this does require more manual work and human intelligence, if each template is generating multiple static pages, this process can finish sooner than semiautomated processing of large numbers of static HTML pages.

Use As a Library

TidyLib is a C library version of Tidy that you can integrate into your own programs. This might be useful for writing scripts that tidy an entire site, for example. Personally, though, I do not find C to be the most conducive language for simple scripting. I usually prefer to write a shell or Perl script that invokes the Tidy command line directly.

One Response to “Tidy”

1. Santosh Patnaik Says:

htmLawed is a small, standalone HTML Tidy alternative that PHP web application developers might be interested in.