Chapter 3: Well-formedness

Monday, July 14th, 2008

Here’s part 15 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

The very first step in moving markup into modern form is to make it well-formed. Well-formedness is the basis of the huge and incredibly powerful XML tool chain. Well-formedness guarantees a single unique tree structure for the document that can be operated on by the DOM, thus making it the basis of reliable, cross-browser JavaScript. The very first thing you need to do is make your pages well-formed.

Validity, although important, is not nearly as crucial as well-formedness. There are often good reasons to compromise on validity. In fact, I often deliberately publish invalid pages. If I need an element the DTD doesn’t allow, I put it in. It won’t hurt anything because browsers ignore elements they don’t understand. If I have a blockquote that contains raw text but no elements, no great harm is done. If I use an HTML 5 element such as m that Opera recognizes and other browsers don’t, those other browsers will just ignore it. However, if the page is malformed, the consequences are much more severe.

First, I won’t be able to use any XML tools, such as XSLT or SAX, to process the page. Indeed, almost the only thing I can do with it is view it in a browser. It is very hard to do any reliable automated processing or testing with a malformed page.

Second, browser display becomes much more unpredictable. Different browsers fill in the missing pieces and correct the mistakes of malformed pages in different ways. Writing cross-platform JavaScript or CSS is hard enough without worrying about what tree each browser will construct from ambiguous HTML. Making the page well-formed makes it a lot more likely that I can make it behave as I like across a wide range of browsers.
(more…)

XSLT

Thursday, July 3rd, 2008

Here’s part 14 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

XSLT (Extensible Stylesheet Language Transformations) is one of many XML tools that work well on HTML documents once they have first been converted into well-formed XHTML. In fact, it is one of my favorite such tools, and the first thing I turn to for many tasks. For instance, I use it to automatically generate a lot of content, such as RSS and Atom feeds, by screen-scraping my HTML pages. Indeed, the possibility of using XSLT on my documents is one of my main reasons for refactoring documents into well-formed XHTML. XSLT can query documents for things you need to fix and automate some of the fixes.
(more…)

TagSoup

Friday, June 27th, 2008

Here’s part 13 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

John Cowan’s TagSoup (http://home.ccil.org/~cowan/XML/tagsoup/) is an open source HTML parser written in Java that implements the Simple API for XML, or SAX. Cowan describes TagSoup as “a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.”
(more…)

Tidy

Thursday, June 26th, 2008

Here’s part 13 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

Regular expressions are well and good for individual, custom changes, but they can be tedious and difficult to use for large quantities of changes. In particular, they are designed more to work with plain text than with semistructured HTML text. For batch changes and automated corrections of common mistakes, it helps to have tools that take advantage of the markup in HTML. The first such tool is Dave Raggett’s Tidy (www.w3.org/People/Raggett/tidy/), the original HTML fixer-upper. It’s a simple, multiplatform command-line program that can correct most HTML mistakes.
(more…)

Regular Expressions

Sunday, June 22nd, 2008

Here’s part 12 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

Manually inspecting and changing each file on even a small site is tedious and often cost-prohibitive. It is much more effective to let the computer do the work by searching for mistakes and, when possible, automatically fixing them. A number of tools support this, including command-line tools such as grep, egrep, and sed; text editors such as jEdit, BBEdit, TextPad, and PSPad; and programming languages such as Java, Perl, and PHP. All these tools provide a specialized search syntax known as regular expressions. Although there are small differences from one tool to the next, the basic regular expression syntax is much the same.

For purposes of illustration, I’m going to use the jEdit text editor as my search and replace tool in this section. I chose it because it provides pretty much all the features you need, it has a reasonable GUI, it’s open source, and it’s written in Java, so it runs on essentially any platform you’re likely to want. You can download a copy from http://jedit.org/.

However, the techniques I’m showing here are by no means limited to that one editor. In my work, I normally use BBEdit instead because it has a slightly nicer interface. However, it’s payware and only runs on the Mac. There are numerous other choices. If you prefer a different program, by all means use it. What you’ll need are:

  • Full regular expression search and replace
  • The ability to recursively search a directory
  • The ability to filter the files you search
  • A tool that shows you what it has changed, but does not require you to manually approve each change
  • Automatic recognition of different character encodings and line-ending conventions

Any tool that meets these criteria should be sufficient.

(more…)