Refactoring HTML – The Cafes

Chapter 3: Well-formedness

Elliotte Rusty Harold — Mon, 14 Jul 2008 13:36:50 +0000

Here’s part 15 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

The very first step in moving markup into modern form is to make it well-formed. Well-formedness is the basis of the huge and incredibly powerful XML tool chain. Well-formedness guarantees a single unique tree structure for the document that can be operated on by the DOM, thus making it the basis of reliable, cross-browser JavaScript. The very first thing you need to do is make your pages well-formed.

Validity, although important, is not nearly as crucial as well-formedness. There are often good reasons to compromise on validity. In fact, I often deliberately publish invalid pages. If I need an element the DTD doesn’t allow, I put it in. It won’t hurt anything because browsers ignore elements they don’t understand. If I have a blockquote that contains raw text but no elements, no great harm is done. If I use an HTML 5 element such as m that Opera recognizes and other browsers don’t, those other browsers will just ignore it. However, if the page is malformed, the consequences are much more severe.

First, I won’t be able to use any XML tools, such as XSLT or SAX, to process the page. Indeed, almost the only thing I can do with it is view it in a browser. It is very hard to do any reliable automated processing or testing with a malformed page.

Second, browser display becomes much more unpredictable. Different browsers fill in the missing pieces and correct the mistakes of malformed pages in different ways. Writing cross-platform JavaScript or CSS is hard enough without worrying about what tree each browser will construct from ambiguous HTML. Making the page well-formed makes it a lot more likely that I can make it behave as I like across a wide range of browsers.

What Is Well-formedness?

Well-formedness is a concept that comes from XML. Technically, it means that a document adheres to certain rigid constraints, such as every start-tag has a matching end-tag, elements must begin and end in the same parent element, and every entity reference is defined.

Classic HTML is based on SGML, which allows a lot more leeway than does XML. For example, in HTML and SGML, it’s perfectly OK to have a
or

tag with no corresponding
and

tags. However, this is no longer allowed in a well-formed document.

Well-formedness ensures that every conforming processor treats the document in the same way at a low level. For example, consider this malformed fragment:

The quick brown fox

jumped over the

lazy dog.

The strong element begins in one paragraph and ends in the next. Different browsers can and do build different internal representations of this text. For example, Firefox and Safari fill in the missing start- and end-tags (including those between the paragraphs). In essence, they treat the preceding fragment as equivalent to this markup:

The quick brown fox

jumped over the

lazy dog.

This creates the tree shown in Figure 3.1.

Figure 3.1: An overlapping tree as interpreted by Firefox and Safari

By contrast, Opera places the second p element inside the strong element which is inside the first p element. In essence the Opera DOM treats the fragment as equivalent to this markup:

The quick brown fox jumped over the

lazy dog.

This builds the tree shown in Figure 3.2.

Figure 3.2: An overlapping tree as interpreted by Opera

If you’ve ever struggled with writing JavaScript code that works the same across browsers, you know how annoying these cross-browser idiosyncrasies can be.

By contrast, a well-formed document removes the ambiguity by requiring all the end-tags to be filled in and all the elements to have a single unique parent. Here is the well-formed markup corresponding to the preceding code:

…foo…

…bar

This leaves no room for browser interpretation. All modern browsers build the same tree structure from this well-formed markup. They may still differ in which methods they provide in their respective DOMs, and in other aspects of behavior, but at least they can agree on what’s in the HTML document. That’s a huge step forward.

Anything that operates on an HTML document, be it a browser, a CSS stylesheet, an XSL transformation, a JavaScript program, or something else, will have an easier time working with a well-formed document than the malformed alternative. For many use cases such as XSLT, this may be critical. An XSLT processor will simply refuse to operate on malformed input. You must make the document well-formed before you can apply an XSLT stylesheet to it.

Most web sites will need to make at least some and possibly all of the following fixes to become well-formed.

Every start-tag must have a matching end-tag.
Empty elements should use the empty-element tag syntax.
Every attribute must have a value.
Every attribute value must be quoted.
Every raw ampersand must be escaped as &.
Every raw less-than sign must be escaped as <.
There must be a single root element.
Every nonpredefined entity reference must be declared in the DTD.

In addition, namespace well-formedness requires that you add an xmlns="http://www.w3.org/1999/xhtml" attribute to the root html element.

Although it’s easy to find and fix some of these problems manually, you’re unlikely to catch all of them without help. As discussed in the preceding chapter, you can use xmllint or other validators to check for well-formedness. For example:

$ xmllint --noout --loaddtd http://www.aw.com
http://www.aw-bc.com/:118: parser error : Specification
mandate value for attribute nowrap

^
http://www.aw-bc.com/:118: parser error : attributes construct error

^
http://www.aw-bc.com/:118: parser error : Couldn't find end
of Start-tag TD line 118

^
…

TagSoup or Tidy can handle many of the necessary fixes automatically. However, they don’t always guess right, so it pays to at least spot-check some of the problems manually before fixing them. Usually it’s simplest to fix as many broad classes of errors as possible. Then run xmllint again to see what you’ve missed.

The following sections discuss the mechanics and trade-offs of each of these changes, as they usually apply in HTML.

XSLT

Elliotte Rusty Harold — Thu, 03 Jul 2008 12:42:57 +0000

Here’s part 14 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

XSLT (Extensible Stylesheet Language Transformations) is one of many XML tools that work well on HTML documents once they have first been converted into well-formed XHTML. In fact, it is one of my favorite such tools, and the first thing I turn to for many tasks. For instance, I use it to automatically generate a lot of content, such as RSS and Atom feeds, by screen-scraping my HTML pages. Indeed, the possibility of using XSLT on my documents is one of my main reasons for refactoring documents into well-formed XHTML. XSLT can query documents for things you need to fix and automate some of the fixes.

When refactoring XHTML with XSLT, you usually leave more alone than you change. Thus, most refactoring stylesheets start with the identity transformation shown in Listing 2.9.

Listing 2.9: The Identity Transformation in XSLT


  xmlns:html='http://www.w3.org/1999/xhtml'
  xmlns='http://www.w3.org/1999/xhtml'
  exclude-result-prefixes='html'>

This merely copies the entire document from the input to the output. You then modify this basic stylesheet with a few extra rules to make the changes you desire. For example, suppose you want to change all the deprecated elements to elements. You would add this rule to the stylesheet:

Notice that the XPath expression in the match attribute must use a namespace prefix, even though the element it’s matching uses the default namespace. This is a common source of confusion when transforming XHTML documents. You always have to assign the XHTML namespace a prefix when you’re using it in an XPath expression.

Note

Several good introductions to XSLT are available in print and on the Web. First, I’ll recommend two I’ve written myself. Chapter 15 of The XML 1.1 Bible (Wiley, 2003) covers XSLT in depth, and is available on the Web at http://www.cafeconleche.org/books/bible3/chapters/ch15.html. XML in a Nutshell, 3rd Edition, by Elliotte Harold and W. Scott Means (O’Reilly, 2004), provides a somewhat more concise introduction. Finally, if you want the most comprehensive coverage available, I recommend Michael Kay’s XSLT: Programmer’s Reference (Wrox, 2001) and XSLT 2.0: Programmer’s Reference (Wrox, 2004).

This concludes Chapter 2. I’ll probably post a couple more sections from Chapter 3. Then if you want to see what comes next, you’ll have to buy the book.

TagSoup

Elliotte Rusty Harold — Fri, 27 Jun 2008 13:19:17 +0000

Here’s part 13 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

John Cowan’s TagSoup (http://home.ccil.org/~cowan/XML/tagsoup/) is an open source HTML parser written in Java that implements the Simple API for XML, or SAX. Cowan describes TagSoup as “a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.”

TagSoup is not intended as an end-user tool, but it does have a basic command-line interface. It’s also straightforward to hook it up to any number of XML tools that accept input from SAX. Once you’ve done that, feed in HTML, and out will come well-formed XHTML. For example:

$ java -jar tagsoup.jar index.html Java Virtual Machines
Java Virtual Machines
…

You can improve its output a little bit by adding the --omit-xml-declaration and --nodefaults command-line options:

$ java -jar tagsoup.jar --omit-xml-declaration --nodefaults index.html Java Virtual Machines
Java Virtual Machines
…

This will remove a few pieces that are likely to confuse one browser or another.

You can use the --encoding option to specify the character encoding of the input document. For example, if you know the document is written in Latin-1, ISO 8859-1, you could run it like so:

$ java -jar tagsoup.jar –encoding=ISO-8859-1 index.html

TagSoup’s output is always UTF-8.

Finally, you can use the --files option to write new copies of the input files with the extension .xhtml. Otherwise, TagSoup prints the output on stdout, from where you can redirect it to any convenient location. TagSoup cannot change a file in place, like Tidy can.

However, TagSoup is primarily designed for use as a library. Its output from command-line mode leaves something to be desired compared to Tidy. In particular:

It does not convert presentational markup to CSS.

It does not include a DOCTYPE declaration, which is needed before some browsers will recognize XHTML.

It does include an XML declaration, which needlessly confuses older browsers.

It uses start-tag and end-tag pairs for empty elements such as br and hr, which may confuse some older browsers.

TagSoup does not guarantee absolutely valid XHTML (though it does guarantee well-formedness). There are a few things it cannot handle. Most important, XHTML requires all img elements to have an alt attribute. If the alt attribute is empty, the image is purely presentational and should be ignored by screen readers. If the attribute is not empty, it is used in place of the image by screen readers. TagSoup has no way of knowing whether any given img with an omitted alt attribute is presentational or not, so it does not insert any such attributes. Similarly, TagSoup does not add summaries to tables. You’ll have to do that by hand, and you’ll want to validate after using TagSoup to make sure you catch all these instances.

However, despite these limits, TagSoup does do a huge amount of work for you at very little cost.

Tidy

Elliotte Rusty Harold — Thu, 26 Jun 2008 15:34:44 +0000

Here’s part 13 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

Regular expressions are well and good for individual, custom changes, but they can be tedious and difficult to use for large quantities of changes. In particular, they are designed more to work with plain text than with semistructured HTML text. For batch changes and automated corrections of common mistakes, it helps to have tools that take advantage of the markup in HTML. The first such tool is Dave Raggett’s Tidy (www.w3.org/People/Raggett/tidy/), the original HTML fixer-upper. It’s a simple, multiplatform command-line program that can correct most HTML mistakes.

-asxhtml

For purposes of this book, you want to use Tidy with the -asxhtml command-line option. For example, this command converts the file index.html to well-formed XHTML and stores the result back into the same file (-m):

$ tidy -asxhtml -m index.html

Frankly, you could do worse than just running Tidy across all your HTML files and calling it a day; but please don’t stop reading just yet. Tidy has a few more options that can improve your code further, and there are some problems it can’t handle or handles incorrectly. For example, when I used this command on one of my older pages that I hadn’t looked at in at least five years, Tidy generated the following error messages:

line 1 column 1 - Warning: missing declaration line 7 column 1 - Warning: attribute "bgcolor" has invalid value "#fffffff" line 16 column 2 - Warning: lacks "summary" attribute line 230 column 1 - Warning:
lacks "summary" attribute line 14 column 91 - Warning: trimming empty
Info: Document content looks like XHTML 1.0 Transitional 5 warnings, 0 errors were found!
These are problems that Tidy mostly didn’t know how to fix. It actually was able to supply a DOCTYPE because I specified XHTML mode, which has a known DOCTYPE. However, it doesn’t know what to do with bgcolor="#fffffff". The problem here is an extra f which should be removed, or perhaps the entire bgcolor attribute should be removed and replaced with CSS.

Tip

Once you’ve identified a problem such as this, it’s entirely possible that the same problem crops up in multiple documents. Having noticed this in one file, it’s worth doing a search and replace across the entire directory tree to catch any other occurrences. Given the prevalence of copy and paste coding, few mistakes occur only once.
The second two problems are tables that lack a summary attribute. This is an accessibility problem, and you should correct it. Tidy actually prints some further details about this:
The table summary attribute should be used to describe the table structure. It is very helpful for people using non-visual browsers. The scope and headers attributes for table cells are useful for specifying which headers apply to each table cell, enabling non-visual browsers to provide a meaningful context for each cell. For further advice on how to make your pages accessible see http://www.w3.org/WAI/GL. You may also want to try "http://www.cast.org/bobby/" which is a free Web-based service for checking URLs for accessibility.
Table summaries are certainly a good thing, and you should certainly add one. However, Tidy can’t summarize your table itself. You’ll have to do that.
The final message warns you that Tidy found an empty paragraph element and threw it away. This common message is probably misleading and definitely worth a second look. What it meant in this case (and in almost every other one you’ll see) is that the
tag was being used as an end-tag rather than a start-tag. That is, the markup looked like this:
Blah blah blah
Blah blah blah
Blah blah blah

Tidy reads it as this:
Blah blah blah
Blah blah blah

Blah blah blah

Thus, it throws away that last empty paragraph. However, what you almost certainly wanted was this:
Blah blah blah

Blah blah blah

Blah blah blah

There’s no easy search and replace for this flaw, though XHTML strict validation will at least alert you to the files in which the problem lies. You can use XSLT (discussed shortly) to fix some of these problems; but if there aren’t too many of them, it’s safer and not hugely onerous to manually edit these files.
If you specify the --enclose-text yes option, Tidy will wrap any such unparented text in a p element. For example:
$ tidy -asxhtml --enclose-text yes example.html
Tidy may alert you to a few other serious problems that you’ll still have to fix manually. These include

A missing end quote for an attribute value; for example, /**/ It also added the necessary class attributes to the elements that had previously used a center element. For example: Java Virtual Machines That’s as far as Tidy goes with CSS, but you might want to consider revisiting all these files to see if you can extract a common external stylesheet that can be shared among the documents on your site, rather than including the rules in each separate page. Furthermore, you should probably consider more semantic class names rather than Tidy’s somewhat prosaic default. For example, I’ve sometimes used the i element to indicate that I’m talking about a word rather than using the word. For example: May can be very ambiguous in English, meaning might, can, or allowed, depending on context. Here the italics are not used for emphasis, and replacing them with em is not appropriate. Instead, they should be indicated with a span and a class, like so: May can be very ambiguous in English, meaning might, can, or allowed, depending on context. Then, add a CSS rule to the stylesheet to format this as italic: span.wordasword { font-style: italic; } Tidy, however, is not smart enough to figure this out, and will need your help. Still, Tidy is a good first step. Encodings Tidy is surprisingly bad at detecting the character encoding of HTML documents, despite the relatively rich metadata in most HTML documents for specifying exactly this. If your content is anything other than ASCII or ISO-8859-1 (Latin-1), you’d best tell Tidy that with the –input-encoding option. For example, if you’ve saved your documents in UTF-8, invoke Tidy thusly: $ tidy -asxhtml -–input-encoding utf8 index.html Tidy generates ASCII text as output unless you tell it otherwise. It will escape non-ASCII characters using named entities when available, and numeric character references when not. However, Tidy supports several other common encodings. The only other one I recommend is UTF-8. To get it use the --output-encoding option: $ tidy -asxhtml -–output-encoding utf8 index.html The input encoding does not have to be the same as the output encoding. However, if it is you can just specify -utf8 instead: $ tidy -asxhtml -utf8 index.html For various reasons, I strongly recommend that you stick to either ASCII or UTF-8. Other encodings do not transfer as reliably when documents are exchanged across different operating systems and locales. Pretty Printing Tidy also has a couple of options that don’t have a lot to do with the HTML itself, but do make documents prettier to look at and thus easier to work with when you open them in a text editor. The -i option indents the text so that it’s easier to see how the elements nest. Tidy is smart enough not to indent whitespace-significant elements such as pre. The -wrap option wraps text at a specified column. Usually about 80 columns are nice. $ tidy -asxhtml -utf8 -i -wrap 80 index.html Generated Code Tidy has limited support for working on PHP, JSP, and ASP pages. Basically, it will ignore the content inside the PHP, ASP, or JSP sections and try to work on the rest of the HTML markup. However, that is very tricky to do. In particular, most templating languages do not respect element boundaries. If code is generating half of an element or a start-tag for an element that is later closed by a literal end-tag, it is very easy for Tidy to get confused. I do not recommend using Tidy directly on these sorts of pages. Instead, download some fully rendered pages from your web site after processing by the template engine. Run Tidy on a representative sample of these pages, and then compare the results to the original pages. By looking at the differences, you can usually figure out what needs to be changed in your templates; then make the changes manually. Although this does require more manual work and human intelligence, if each template is generating multiple static pages, this process can finish sooner than semiautomated processing of large numbers of static HTML pages. Use As a Library TidyLib is a C library version of Tidy that you can integrate into your own programs. This might be useful for writing scripts that tidy an entire site, for example. Personally, though, I do not find C to be the most conducive language for simple scripting. I usually prefer to write a shell or Perl script that invokes the Tidy command line directly. Regular Expressions Elliotte Rusty Harold — Sun, 22 Jun 2008 17:49:21 +0000 Here’s part 12 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari. Manually inspecting and changing each file on even a small site is tedious and often cost-prohibitive. It is much more effective to let the computer do the work by searching for mistakes and, when possible, automatically fixing them. A number of tools support this, including command-line tools such as grep, egrep, and sed; text editors such as jEdit, BBEdit, TextPad, and PSPad; and programming languages such as Java, Perl, and PHP. All these tools provide a specialized search syntax known as regular expressions. Although there are small differences from one tool to the next, the basic regular expression syntax is much the same. For purposes of illustration, I’m going to use the jEdit text editor as my search and replace tool in this section. I chose it because it provides pretty much all the features you need, it has a reasonable GUI, it’s open source, and it’s written in Java, so it runs on essentially any platform you’re likely to want. You can download a copy from http://jedit.org/. However, the techniques I’m showing here are by no means limited to that one editor. In my work, I normally use BBEdit instead because it has a slightly nicer interface. However, it’s payware and only runs on the Mac. There are numerous other choices. If you prefer a different program, by all means use it. What you’ll need are: Full regular expression search and replace The ability to recursively search a directory The ability to filter the files you search A tool that shows you what it has changed, but does not require you to manually approve each change Automatic recognition of different character encodings and line-ending conventions Any tool that meets these criteria should be sufficient. Searching The first goal of a regular expression is to find things that may be wrong. For example, I recently noticed that I had mistyped some dates as 20066 instead of 2006 in one of my files. That’s an error that’s likely to have happened more than once, so I checked for it by searching for that string. In jEdit, you perform a multifile search using the Search/Search in Directory menu item. Selecting this menu item brings up the dialog shown in Figure 2.6. This is normally configured more or less as shown here. The string you’re searching for (the target string) goes in the first text field. The string that will replace the target string goes in the second text field. Here I’m just going to find, not replace, so I haven’t entered a replacement string. The Directory radio button is checked to indicate that you’re going to search multiple files. You can also search just in the current file, or even the current selection. The filter is set to *.html to search only those files that end in .html. You can modify this to search different kinds of or subsets of files. For instance, I often want to search only my old news files, which are named news2000.html, news2001.html, news2002.html, and so on. In that case, I would set the filter to news2.*html. I could search even older files including news1999.html by rewriting the filter regular expression in a form such as news\d\d\d\d.html. I specify the directory where I’ve stored my local copy of the files I’m searching. In my case, this is /Users/elharo/Cafe au Lait/javafaq. “Search subdirectories” is checked. If it weren’t, jEdit would search only the javafaq directory, but not any directories that directory contains. “Keep dialog” is checked. This keeps the dialog box open after the search is completed. “Ignore case” is checked. This will allow the regular expression to match regardless of case. This isn’t always what you want, but more often than not it is. “Regular expressions” is checked. You don’t need to check this when you’re only searching for a constant string, as here. However, most searches are more complex than that. HyperSearch is checked. This will bring up a window showing all matches, rather than just finding the next match. Figure 2.6: jEdit multifile search Fortunately, that particular problem seems to have been isolated. However, I also recently noticed another, more serious problem. For some unknown reason, I somehow had managed to write links with double equals signs, as shown here, throughout one of my sites: Cafe au Lait Consequently, links were breaking all over the place. The first step was to find out how broad the problem was. In this case, the mistaken string was constant, and was unlikely to appear in correct text, so it was easy to search for. This problem turned up 4,475 times in 476 files, as shown in the HyperSearch results in Figure 2.7. Figure 2.7: jEdit search results When there aren’t a lot of mistakes, you can click on each one to open the document and fix it manually. Sometimes this is needed. Sometimes this is even the easiest solution. However, when there are thousands of mistakes, you have to fix them with a tool. In this case, the solution is straightforward. Put href= in the “Replace with” field; then click the “Replace all” button. Do be careful when performing this sort of operation, though. A small mistake can cause bigger problems. A bad search and replace likely caused this problem in the first place. You should test your regular expression search and replace on a few files first before trying it on an entire site. Most important, always work on a backup copy of the site; always run your test suite after each change; and always spot-check at least some of the files that have been changed to make sure nothing went wrong. If something does go wrong, an editor with undo capability can be very useful. Not all editors support multifile undo with a buffer that’s large enough to handle thousands of changes. If yours doesn’t, be ready to delete your working copy and replace it with the original in case the search goes wrong. Like any other complex bit of code, sometimes you have to try several times to fully debug a regular expression. Search Patterns Often, you don’t know exactly what you’re searching for, but you do know its general pattern. For example, if you’re searching for years in the recent past, you might want to find any four-digit number beginning with 200. You may want to search for attribute name=value pairs, but you’re not sure whether they’re in the format name=value, name=’value’, or name=”value”. You may want to search for all start-tags, whether they have attributes or not. These are all good candidates for regular expressions. In a regular expression, certain characters and patterns stand in for a set of other characters. For example, \d means any digit. Thus, to search for any year from 2000 to 2009, one could use the regular expression 200\d. This would match 2000, 2001, 2002, and so on through 2009. However, the regular expression 200\d also matches 12000, 200032, 12320056, and other strings that are probably not years at all. (To be precise, it matches the substrings in the form 200\d, not the entire string.) Thus, you might want to indicate that the string you’re matching must be preceded and trailed by whitespace of some kind. The metacharacter \s matches whitespace, so we can now rewrite the expression as \s200\d\s to match only those strings that look like years in this decade. Of course, there’s still no guarantee that every string you match in this form is a year. It could be a price, a population, a score, a movie title, or something else. You’ll want to scan the list of matches to verify that it is what you expect. False positives are a real concern, especially for simple cases such as this. However, it’s normally possible to either further refine the regular expression to avoid any false positives or manually remove the accidental matches. There usually are other ways to do many things. For instance, we could write this search as \b200\d\b. The metacharacter \b matches the beginning or end of a word, without actually selecting any characters. This would avoid the whitespace at the beginning and end of words. This would also allow us to recognize a year that came at the end of a sentence right before a period, as in “This is 2008”. However, it can’t distinguish periods from decimal points and would also match the 2005 in 2005.3124. You could even simply list the years separated by the OR operator, |, like so: 2000|2001|2002|2003|2004|2005|2006|2007|2008|2009 However, this still has the word boundary problems of the previous matches. Sometimes you stop with a search. In particular, if the content is generated automatically from a CMS, template page, or other program, the search is used merely to find bugs: places where the program is generating incorrect markup. You then must change the program to generate correct markup. If this is the case, false positives don’t worry you nearly so much because all changes will be performed manually anyway. The search only identifies the bug. It doesn’t fix it. If you don’t stop with a search, and you go on to a replacement, you need to be cautious. Regular expressions can be tricky, and ones involving HTML are often much trickier than the textbook examples. Nonetheless, they are invaluable tools in cleaning up HTML. Note If you don’t have a lot of experience with regular expressions, please refer to Appendix 1 for many more examples. I also recommend Mastering Regular Expressions, 3rd Edition, by Jeffrey E. F. Friedl (O’Reilly, 2006). Testing Elliotte Rusty Harold — Thu, 19 Jun 2008 03:17:55 +0000 Here’s part 11 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari. This part’s a little funny because really it deserves an entire book on its own, and that book has yet to be written. I didn’t have space or time to write a complete second book about test driven development of web sites and web applications, but perhaps this small piece will inspire someone else to do it. If not, maybe I’ll get to it one of these days. In theory, refactoring should not break anything that isn’t already broken. In practice, it isn’t always so reliable. To some extent, the catalog later in this book shows you what changes you can safely make. However, both people and tools do make mistakes; and it’s always possible that refactoring will introduce new bugs. Thus, the refactoring process really needs a good automated test suite. After every refactoring, you’d like to be able to press a button and see at a glance whether anything broke. Although test-driven development has been a massive success among traditional programmers, it is not yet so common among web developers, especially those working on the front end. In fact, any automated testing of web sites is probably the exception rather than the rule, especially when it comes to HTML. It is time for that to change. It is time for web developers to start to write and run test suites and to use test-driven development. The basic test-driven development approach is as follows: Write a test for a feature. Code the simplest thing that can possibly work. Run all tests. If tests passed, goto 1. Else, goto 2. For refactoring purposes, it is very important that this process be as automatic as possible. In particular: The test suite should not require any complicated setup. Ideally, you should be able to run it with the click of a button. You don’t want developers to skip running tests because they’re too hard to run. The tests should be fast enough that they can be run frequently; ideally, they should take 90 seconds or less to run. You don’t want developers to skip running tests because they take too long. The result must be pass or fail, and it should be blindingly obvious which it is. If the result is fail, the failed tests should generate more output explaining what failed. However, passing tests should generate no output at all, except perhaps for a message such as “All tests passed”. In particular, you want to avoid the common problem in which one or two failing tests get lost in a sea of output from passing tests. Writing tests for web applications is harder than writing tests for classic applications. Part of this is because the tools for web application testing aren’t as mature as the tools for traditional application testing. Part of this is because any test that involves looking at something and figuring out whether it looks right is hard for a computer. (It’s easy for a person, but the goal is to remove people from the loop.) Thus, you may not achieve the perfect coverage you can in a Java or .NET application. Nonetheless, some testing is better than none, and you can in fact test quite a lot. One thing you will discover is that refactoring your code to web standards such as XHTML is going to make testing a lot easier. Going forward, it is much easier to write tests for well-formed and valid XHTML pages than for malformed ones. This is because it is much easier to write code that consumes well-formed pages than malformed ones. It is much easier to see what the browser sees, because all browsers see the same thing in well-formed pages and different things in malformed ones. Thus, one benefit of refactoring is improving testability and making test-driven development possible in the first place. Indeed, with a lot of web sites that don’t already have tests, you may need to refactor them enough to make testing possible before moving forward. You can use many tools to test web pages, ranging from decent to horrible and free to very expensive. Some of these are designed for programmers, some for web developers, and some for business domain experts. They include: HtmlUnit JsUnit HttpUnit JWebUnit FitNesse Selenium In practice, the rough edges on these tools make it very helpful to have an experienced agile programmer develop the first few tests and the test framework. Once you have an automated test suite in place, it is usually easier to add more tests yourself. JUnit JUnit (http://www.junit.org/) is the standard Java framework for unit testing, and the one on which a lot of the more specific frameworks such as HtmlUnit and HttpUnit are built. There’s no reason you can’t use it to test web applications, provided you can write Java code that pretends to be a browser. That’s actually not as hard as it sounds. For example, one of the most basic tests you’ll want to run is one that tests whether each page on your site is well-formed. You can test this simply by parsing the page with an XML parser and seeing whether it throws any exceptions. Write one method to test each page on the site, and you have a very nice automated test suite for what we checked by hand in the previous section. Listing 2.2 demonstrates a simple JUnit test that checks the well-formedness of my blog. All this code does is throw a URL at an XML parser and see whether it chokes. If it doesn’t, the test passes. This version requires Sun’s JDK 1.5 or later and JUnit 3.8 or later somewhere in the classpath. You may need to make some modifications to run this in other environments. Listing 2.2: A JUnit Test for Web Site Well-Formedness import java.io.IOException; import junit.framework.TestCase; import org.xml.sax.*; import org.xml.sax.helpers.XMLReaderFactory; public class WellformednessTests extends TestCase { private XMLReader reader; public void setUp() throws SAXException { reader = XMLReaderFactory.createXMLReader( "com.sun.org.apache.xerces.internal.parsers.SAXParser"); } public void testBlogIndex() throws SAXException, IOException { reader.parse("http://www.elharo.com/blog/"); } } You can run this test from inside an IDE such as Eclipse or NetBeans, or you can run it from the command line like so: $ java -cp .:junit.jar junit.swingui.TestRunner WellformednessTests If all tests pass, you’ll see a green bar as shown in Figure 2.3. Figure 2.3: All tests pass. To test additional pages for well-formedness, you simply add more methods, each of which looks exactly like testBlogIndex, just with a different URL. Of course, you can also write more complicated tests. You can test for validity by setting the http://xml.org/sax/features/validation feature on the parser and attaching an error handler that throws an exception if a validity error is detected. You can use DOM, XOM, SAX, or some other API to load the page and inspect its contents. For instance, you could write a test that checks whether all links on a page are reachable. If you use TagSoup as the parser, you can even write these sorts of tests for non-well-formed HTML pages. You can submit forms using the HttpURLConnection class or run JavaScript using the Rhino engine built into Java 6. This is all pretty low-level stuff, and it’s not trivial to do; but it’s absolutely possible to do it. You just have to roll up your sleeves and start coding. If nondevelopers are making regular changes to your site, you can set up the test suite to run periodically with cron and to e-mail you if anything unexpectedly breaks. (It’s probably not reasonable to expect each author or designer to run the entire test suite before every check-in.) You can even run the suite continuously using a product such as Hudson or Cruise Control. However, that may fill your logs with a lot of uncountable test traffic, so you may wish to run this against the development server instead. Many similar test frameworks are available for other languages and platforms: PyUnit for Python, CppUnit for C++, NUnit for .NET, and so forth. Collectively these go under the rubric xUnit. Whichever one you and your team are comfortable working with is fine for writing web test suites. The web server doesn’t care what language your tests are written in. As long as you have a one-button test harness and enough HTTP client support to write tests, you can do what needs to be done. HtmlUnit HtmlUnit (http://htmlunit.sourceforge.net/) is an open source JUnit extension designed to test HTML pages. It will be most familiar and comfortable to Java programmers already using JUnit for test-driven development. HtmlUnit provides two main advantages over pure JUnit. The WebClient class makes it much easier to pretend to be a web browser. The HTMLPage class has methods for inspecting common parts of an HTML document. For example, HtmlUnit will run JavaScript that’s specified by an onLoad handler before it returns the page to the client, just like a browser would. Simply loading the page with an XML parser as Listing 2.2 did would not run the JavaScript. Listing 2.3 demonstrates the use of HtmlUnit to check that all the links on a page are not broken. I could have written this using a raw parser and DOM, but it would have been somewhat more complex. In particular, methods such as getAnchors to find all the a elements in a page are very helpful. Listing 2.3: An HtmlUnit Test for a Page’s Links import java.io.IOException; import java.net.*; import java.util.*; import com.gargoylesoftware.htmlunit.*; import com.gargoylesoftware.htmlunit.html.*; import junit.framework.TestCase; public class LinkCheckTest extends TestCase { public void testBlogIndex() throws FailingHttpStatusCodeException, IOException { WebClient webClient = new WebClient(); URL url = new URL("http://www.elharo.com/blog/"); HtmlPage page = (HtmlPage) webClient.getPage(url); List links = page.getAnchors(); Iterator iterator = links.iterator(); while (iterator.hasNext()) { HtmlAnchor link = (HtmlAnchor) iterator.next(); URL u = new URL(link.getHrefAttribute()); // Check that we can download this page. // If we can't, getPage throws an exception and // the test fails. webClient.getPage(u); } } } This test is more than a unit test. It checks all the links on a page, whereas a real unit test would check only one. Furthermore, it makes connections to external servers. That’s very unusual for a unit test. Still, this is a good test to have, and it will let us know that we need to fix our pages if an external site breaks links by reorganizing its pages. HttpUnit HttpUnit (http://httpunit.sourceforge.net/) is another open source JUnit extension designed to test HTML pages. It is also best suited for Java programmers already using JUnit for test-driven development, and is in many ways quite similar to HtmlUnit. Some programmers prefer HttpUnit, and others prefer HtmlUnit. If there’s a difference between the two it’s that HttpUnit is somewhat lower-level. It tends to focus more on the raw HTTP connection whereas HtmlUnit more closely imitates a browser. HtmlUnit has somewhat better support for JavaScript, if that’s a concern. However, there’s certainly a lot of overlap between the two projects. Listing 2.4 demonstrates an HttpUnit test that verifies that a page has exactly one H1 header, and that its text matches the web page’s title. That may not be a requirement for all pages, but it is a requirement for some. For instance, it would be a very apt requirement for a newspaper site. Listing 2.4: An HttpUnit Test That Matches the Title to a Unique H1 Heading import java.io.IOException; import org.xml.sax.SAXException; import com.meterware.httpunit.*; import junit.framework.TestCase; public class TitleChecker extends TestCase { public void testFormSubmission() throws IOException, SAXException { WebConversation wc = new WebConversation(); WebResponse wr = wc.getResponse("http://www.elharo.com/blog/"); HTMLElement[] h1 = wr.getElementsWithName("h1"); assertEquals(1, h1.length); String title = wr.getTitle(); assertEquals(title, h1[0].getText()); } } I could have written this test in HtmlUnit too, and I could have written Listing 2.3 with HttpUnit. Which one you use is mostly a matter of personal preference. Of course, these are hardly the only such frameworks. There are several more, including ones not written in Java. Use whichever one you like, but by all means use something. JWebUnit JWebUnit is a higher-level API that sits on top of HtmlUnit and JUnit. Generally, JWebUnit tests involve more assertions and less straight Java code. These tests are somewhat easier to write without a large amount of Java expertise, and they may be more accessible to a typical web developer. Furthermore, tests can very easily extend over multiple pages as you click links, submit forms, and in general follow an entire path through a web application. Listing 2.5 demonstrates a JWebUnit test for the search engine on my web site. It fills in the search form on the main page and submits it. Then it checks that one of the expected results is present. Listing 2.5: A JWebUnit Test for Submitting a Form import junit.framework.TestCase; import net.sourceforge.jwebunit.junit.*; public class LinkChecker extends TestCase { private WebTester tester; public LinkChecker(String name) { super(name); tester = new WebTester(); tester.getTestContext().setBaseUrl("http://www.elharo.com/"); } public void testFormSubmission() { // start at this page tester.beginAt("/blog/"); // check that the form we want is on the page tester.assertFormPresent("searchform"); /// check that the input element we expect is present tester.assertFormElementPresent("s"); // type something into the input element tester.setTextField("s", "Linux"); // send the form tester.submit(); // we're now on a different page; check that the // text on that page is as expected. tester.assertTextPresent("Windows Vista"); } } FitNesse FitNesse (http://fitnesse.org/) is a Wiki designed to enable business users to write tests in table format. Business users like spreadsheets. The basic idea of FitNesse is that tests can be written as tables, much like a spreadsheet. Thus, FitNesse tests are not written in Java. Instead, they are written as a table in a Wiki. You do need a programmer to install and configure FitNesse for your site. However, once it’s running and a few sample fixtures have been written, it is possible for savvy business users to write more tests. FitNesse works best in a pair environment, though, where one programmer and one business user can work together to define the business rules and write tests for them. For web app acceptance testing, you install Joseph Bergin’s HtmlFixture (http://fitnesse.org/FitNesse.HtmlFixture). It too is based on HtmlUnit. It supplies instructions that are useful for testing web applications such as typing into forms, submitting forms, checking the text on a page, and so forth. Listing 2.6 demonstrates a simple FitNesse test that checks the http-equiv meta tag in the head to make sure it’s properly specifying UTF-8. The first three lines set the classpath. Then, after a blank line, the next line identifies the type of fixture as an HtmlFixture. (There are several other kinds, but HtmlFixture is the common one for testing web applications.) The external page at http://www.elharo.com/blog/ is then loaded. In this page, we focus on the element named meta that has an id attribute with the value charset. This will be the subject for our tests. The test then looks at two attributes of this element. First it inspects the content attribute and asserts that its value is text/html; charset=utf-8. Next it checks the http-equiv attribute of the same element and asserts that its value is content-type. Listing 2.6: A FitNesse Test for !path fitnesse.jar !path htmlunit-1.5/lib/*.jar !path htmlfixture20050422.jar !|com.jbergin.HtmlFixture| |http://www.elharo.com/blog/| |Element Focus|charset |meta| |Attribute |content |text/html; charset=utf-8| |Attribute |http-equiv|content-type| This test would be embedded in a Wiki page. You can run it from a web browser just by clicking the Test button, as shown in Figure 2.4. If all of the assertions pass, and if nothing else goes wrong, the test will appear green after it is run. Otherwise, it will appear pink. You can use other Wiki markup elsewhere in the page to describe the test. Figure 2.4: A FitNesse page Selenium Selenium is an open source browser-based test tool designed more for functional and acceptance testing than for unit testing. Unlike with HttpUnit and HtmlUnit, Selenium tests run directly inside the web browser. The page being tested is embedded in an iframe and the Selenium test code is written in JavaScript. It is mostly browser and platform independent, though the IDE for writing tests, shown in Figure 2.5, is limited to Firefox. Figure 2.5: The Selenium IDE Although you can write tests manually in Selenium using remote control, it is really designed more as a traditional GUI record and playback tool. This makes it more suitable for testing an application that has already been written, and less suitable for doing test-driven development. Selenium is likely to be more comfortable to front-end developers who are accustomed to working with JavaScript and HTML. It is also likely to be more palatable to professional testers because it’s similar to some of the client GUI testing tools they’re already familiar with. Listing 2.7 demonstrates a Selenium test that verifies that www.elharo.com shows up in the first page of results from a Google search for “Elliotte”. This script was recorded in the Selenium IDE and then edited a little by hand. You can load it into and then run it from a web browser. Unlike the other examples given here, this is not Java code, and it does not require major programming skills to maintain. Selenium is more of a macro language than a programming language. Listing 2.7: Test That elharo.com Is in the Top Search Results for Elliotte elharo.com is a top search results for Elliotte

New Test

open /

type q elliotte

clickAndWait btnG

verifyTextPresent www.elharo.com/

Obviously, Listing 2.6 is a real HTML document. You can open this with the Selenium IDE in Firefox and then run the tests. Because the tests run directly inside the web browser, Selenium helps you find bugs that occur in only one browser or another. Given the wide variation in browsers that CSS, HTML, and JavaScript support this capability is very useful. HtmlUnit, HttpUnit, JWebUnit, and the like use their own JavaScript engines which do not always have the same behavior as the browsers’ engines. Selenium uses the browsers themselves, not imitations of them. The IDE can also export the tests as C#, Java, Perl, Python, or Ruby code so that you can integrate Selenium tests into other environments. This is especially important for test automation. Listing 2.8 shows the same test as in Listing 2.7, but this time in Ruby. However, this will not necessarily catch all the cross-browser bugs you’ll find by running the tests directly in the browser. Listing 2.8: Automated Test That elharo.com Is in the Top Search Results for Elliotte require "selenium" require "test/unit" class GoogleSearch < Test::Unit::TestCase def setup @verification_errors = [] if $selenium @selenium = $selenium else @selenium = Selenium::SeleneseInterpreter.new("localhost", 4444, *firefox", "http://localhost:4444", 10000); @selenium.start end @selenium.set_context("test_google_search", "info") end def teardown @selenium.stop unless $selenium assert_equal [], @verification_errors end def test_google_search @selenium.open "/" @selenium.type "q", "elliotte" @selenium.click "btnG" @selenium.wait_for_page_to_load "30000" begin assert @selenium.is_text_present("www.elharo.com/") rescue Test::Unit::AssertionFailedError @verification_errors << $! end end end Getting Started with Tests Because you’re refactoring, you already have a web site or application; and if it’s like most I’ve seen, it has limited, if any, front-end tests. Don’t let that discourage you. Pick the tool you like and start to write a few tests for some basic functionality. Any tests at all are better than none. At the early stages, testing is linear. Every test you write makes a noticeable improvement in your code coverage and quality. Don’t get bogged down thinking you have to test everything. That’s great if you can do it; but if you can’t, you can still do something. Before refactoring a particular page, subdirectory, or path through a site, take an hour and write at least two or three tests for that section. If nothing else, these are smoke tests that will let you know if you totally muck up everything. You can expand on these later when you have time. If you find a bug, by all means write a test for the bug before fixing it. That will help you know when you’ve fixed the bug, and it will prevent the bug from accidentally reoccurring in the future after other changes. Because front-end tests aren’t very unitary, it’s likely that this test will indirectly test other things besides the specific bit of buggy code. Finally, for new features and new developments beyond refactoring, by all means write your tests first. This will guarantee that the new parts of the site are tested, and tests will often leak over into the older pages and scripts as well. Automatic testing is critical to developing a robust, scalable application. Developing a test suite can seem daunting at first, but it’s worth doing. The first test is the hardest. Once you’ve set up your test framework and written your first test, subsequent tests will flow much more easily. Just as you can improve a site linearly through small, automatic refactorings that add up over time, so too can you improve your test suite by adding just a few tests a week. Sooner than you know, you’ll have a solid test suite that helps to ensure reliability by telling you when things are broken and showing you what to fix.
Validators Elliotte Rusty Harold — Mon, 16 Jun 2008 14:59:47 +0000 Here’s part 10 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari. There really are standards for HTML, even if nobody follows them. One way to find out whether a site follows HTML standards is to run a page through a validation service. The results can be enlightening. They will provide you with specific details to fix, as well as a good idea of how much work you have ahead of you. The W3C Markup Validation Service For public pages, the validator of choice is the W3C’s Markup Validation Service, at http://validator.w3.org/. Simply enter the URL of the page you wish to check, and see what it tells you. For example, Figure 2.1 shows the result of validating my blog against this service. Figure 2.1: The W3C Markup Validation Service It seems I had misremembered the syntax of the blockquote element. I had mistyped the cite attribute as the source attribute. This was actually better than I expected. I fixed that and rechecked, as shown in Figure 2.2. Now the page is valid. Figure 2.2: Desired results: a valid page It is usually not necessary to check each and every page on your site for these sorts of errors. Most errors repeat themselves. Generally, once you identify a class of errors, it becomes possible to find and automatically fix the problems. For example, here I could simply search for http://xmlsoft.org/. There are advantages and disadvantages to using a generic XML validator to check HTML. One advantage is that you can separate well-formedness checking from validity checking. It is usually easier to fix well-formedness problems first, and then fix validity problems. Indeed, that is the order in which this book is organized. Well-formedness is also more important than validity. The first disadvantage of using a generic XML validator is that it won’t catch HTML-specific problems that are not specifically spelled out in the DTD. For instance, it won’t notice an a element nested inside another a element (though that problem doesn’t come up a lot in practice). The second disadvantage is that it will have to actually read the DTD. It doesn’t assume anything about the document it’s checking. Using xmllint to check for well-formedness is straightforward. Just point it at the local file or remote URL you wish to check from the command line. Use the –noout option to say that the document itself shouldn’t be printed, and –loaddtd to allow entity references to be resolved. For example: $ xmllint --noout --loaddtd http://www.aw.com http://www.aw-bc.com/:118: parser error : Specification mandate value for attribute nowrap ^ http://www.aw-bc.com/:118: parser error : attributes construct error ^ http://www.aw-bc.com/:118: parser error : Couldn't find end of Start Tag TD line 118 ^ http://www.aw-bc.com/:120: parser error : Opening and ending tag mismatch: IMG line 120 and A Benjamin Cummings" WIDTH="84" HEIGHT="64" HSPACE="0" VSPACE="0" BORDER="0"> … When you first run a report such as this, the number of error messages can be daunting. Don’t despair—start at the top and fix the problems one by one. Most errors fall into certain common categories which we will discuss later in the book, and you can fix them en masse. For instance, in this example, the first error is a valueless nowrap attribute. You can fix this simply by searching for nowrap and replacing it with nowrap=”nowrap”. Indeed, with a multifile search and replace, you can fix this problem on an entire site in less than five minutes. (I’ll get to the details of that a little later in this chapter.) The next problem is an IMG element that uses a start-tag rather than an empty-element tag. This one isn’t quite as easy, but you can fix most occurrences by searching for BORDER="0"> and replacing it with border="0" />. That won’t catch all of the problems with IMG elements, but it will fix a lot of them. After each change, you run the validator again. You should see fewer problems with each pass, though occasionally a new one will crop up. Simply iterate and repeat the process until there are no more well-formedness errors. It is important to start with the first error in the list, though, and not pick an error randomly. Often, one early mistake can cause multiple well-formedness problems. This is especially true for omitted start-tags and end-tags. Fixing an early problem often removes the need to fix many later ones. Once you have achieved well-formedness, the next step is to check validity. You simply add the --valid switch on the command line, like so: $ xmllint –noout –loaddtd –valid valid_aw.html This will likely produce many more errors to inspect and fix, though these are usually not as critical or problematic. The basic approach is the same, though: Start at the beginning and work your way through until all the problems are solved. Editors Many HTML editors have built-in support for validating pages. For example, in BBEdit you can just go to the Markup menu and select Check/Document Syntax to validate the page you’re editing. In Dreamweaver, you can use the context menu that offers a Validate Current Document item. (Just make sure the validator settings indicate XHTML rather than HTML.) In essence, these tools just run the document through a parser such as xmllint to see whether it’s error-free. If you’re using Firefox, you should install Chris Pederick’s Web Developer plug-in (https://addons.mozilla.org/firefox/60/). Once you’ve done that, you can validate any page by going to Tools/Web Developer/Tools/Validate HTML. This loads the current page in the W3C validator. The plug-in also provides a lot of other useful options in Firefox. Whatever tool or technique you use to find the markup mistakes, validating is the first step to refactoring into XHTML. Once you see what the problems are, you’re halfway to fixing them. Continued tomorrow… Chapter 2: Tools Elliotte Rusty Harold — Thu, 12 Jun 2008 14:00:26 +0000 Today we start Chapter 2 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari. Automatic tools are a critical component of refactoring. Although you can perform most refactoring manually with a text editor, and although I will sometimes demonstrate refactoring that way for purposes of illustration, in practice we almost always use software to help us. To my knowledge no major refactoring browsers are available for HTML at the time of this writing. However, a lot of tools can assist in many of the processes. In this section, I’ll explain some of them. Backups, Staging Servers, and Source Code Control Throughout this book, I’m going to show you some very powerful tools and techniques. As the great educator Stan Lee taught us, “With great power comes great responsibility.” Your responsibility is to not irretrievably break anything while using these techniques. Some of the tools I’ll show can misbehave. Some of them have corner cases where they get confused. A huge amount of bad HTML is out there, not all of which the tools discussed here have accounted for. Consequently, refactoring HTML requires at least a five-step process. Identify the problem. Fix the problem. Verify that the problem has been fixed. Check that no new problems have been introduced. Deploy the solution. Because things can go wrong, you should not use any of these techniques on a live site. Instead, make a local copy of the site before making any changes. After making changes to your local copy, carefully verify all pages once again before you deploy. Most large sites today already use staging or development servers where content can be deployed and checked before the public sees it. If you’re just working on a small personal static site, you can make a local copy on your hard drive instead; but by all means work on a copy and check the changes before deploying them. How to check the changes is the subject of the next section. Of course, even with the most careful checks, sometimes things slip by and are first noticed by end-users. Sometimes a site works perfectly well on a staging server and has weird problems on the production server due to unrecognized configuration differences. Thus, it’s a very good idea to have a full and complete backup of the production site that you can restore to in case the newly deployed site doesn’t behave as expected. Regular, reliable, tested backups are a must. Finally, you should very seriously consider storing all your code, including all your HTML, CSS, and images, in a source code control system. Programmers have been using source code control for decades, but it’s a relatively uncommon tool for web developers and designers. It’s time for that to change. The more complex a site is, the more likely it is that subtle problems will slip in unnoticed at first. When refactoring, it is critical to be able to go back to previous versions, maybe even from months or years ago, to find out which change introduced a bug. Source code control also provides timestamped backups so that it’s possible to revert your site to its state at any given point in time. I strongly recommend Subversion for web development, mostly because of its strong support for moving files from one directory to another, though its excellent Unicode support and decent support for binary files are also helpful. Most source code control systems are set up for programmers who rarely bother to move files from one directory to another. By contrast, web developers frequently reorganize site structures (more frequently than they should, in fact). Consequently, a system really needs to be able to track histories across file moves. If your organization has already set up some other source code control system such as CVS, Visual SourceSafe, ClearCase, or Perforce, you can use that system instead; but Subversion is likely to work better and cause you fewer problems in the long run. The topic of managing Subversion could easily fill a book on its own; and indeed, several such books are available. (My favorite is Pragmatic Version Control Using Subversion by Mike Mason [The Pragmatic Bookshelf, 2006].) Many large sites hire people whose sole responsibility is to manage the source code control repository. However, don’t be scared off. Ultimately, setting up Subversion or another source code control repository is no harder than setting up Apache or another web server. You’ll need to read a little documentation. You’ll need to tweak some config files, and you may need to ask for help from a newsgroup or conduct a Google search to get around a rough spot. However, it’s eminently doable, and it’s well worth the time invested. You can check files into or out of Subversion from the command line if necessary. However, life is usually simpler if you use an editor such as BBEdit that has built-in support for Subversion. Plug-ins are available that add Subversion support to editors such as Dreamweaver that don’t natively support it. Furthermore, products such as TortoiseSVN and SCPlugin are available that integrate Subversion support directly into Windows Explorer or the Mac Finder. Some content management systems (CMSs) have built-in version control. If yours does, you may not need to use an external repository. For instance, MediaWiki stores a record of all changes that have been made to all pages. It is possible at any point to see what any given page looked like at any moment in time and to revert to that appearance. This is critical for MediaWiki’s model installation at Wikipedia, where vandalism is a real problem. However, even private sites that are not publicly editable can benefit greatly from a complete history of the site over time. Although Wikis are the most common use of version control on the Web, some other CMSs such as Siteline also bundle this functionality. Objections to Refactoring Elliotte Rusty Harold — Tue, 10 Jun 2008 14:24:15 +0000 Here’s part 8 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari. It is not uncommon for people ranging from the CEO to managers to HTML grunts to object to the concept of refactoring. The concern is expressed in many ways, but it usually amounts to this: We don’t have the time to waste on cleaning up the code. We have to get this feature implemented now! There are two possible responses to this comment. The first is that refactoring saves time in the long run. The second is that you have more time than you think you do. Both are true. Refactoring saves time in the long run, and often in the short run because clean code is easier to fix and easier to maintain. It is easier to build on bedrock than quicksand. It is much easier to find bugs in clean code than in opaque code. In fact, when the source of a bug doesn’t jump out at me, I usually begin to refactor until it does. The process of refactoring changes both the code itself and my view of the code. Refactoring can move my eyes into new places, and allow me to see old code in ways I didn’t see it before. Of course, for maximum time savings, it’s important to automate as much of the refactoring as possible. This is why I’m going to emphasize tools such as Tidy and TagSoup, as well as simple, regular-expression-based solutions. Although some refactorings require significant human effort—converting a site from tables to CSS layouts, for example—many others can be done with the click of a button—converting a static page to well-formed XHTML, for example. Many refactorings lay somewhere in between. Less well recognized is that a lot more time is actually available for refactoring than most managers count on their timesheets. Writing new code is difficult, and it requires large blocks of uninterrupted time. A typical work day filled with e-mail, phone calls, meetings, smoking breaks, and so on sadly doesn’t offer many long, uninterrupted blocks in which developers can work. By contrast, refactoring is not hard. It does not require large blocks of uninterrupted time. Sixty minutes of refactoring done in six-minute increments at various times during the day has close to the same impact as one 60-minute block of refactoring. Sixty minutes of uninterrupted time is barely enough to even start to code, though, and six-minute blocks are almost worthless for development. It’s also worth taking developers’ moods into account. The simple truth is that you’re more productive at some times than at other times. Sometimes you can bang out hundreds of lines of code almost as fast as you can type, and other times it’s an effort to force your fingers to the keyboard. Sometimes you’re in the zone, completely focused on the task at hand. Other times you’re distracted by an aching tooth, an upcoming client meeting, and your weekend plans. Coding, design, and other complex tasks don’t work well when you’re distracted; but refactoring isn’t a complex task. It’s an easy task. Refactoring enables you to get something done and move forward, even when you’re operating at significantly reduced efficiency. Perhaps most important, I find that when I am operating at less than peak efficiency, refactoring enables me to pick up speed and reach the zone. It’s a way to dip a toe into the shallow end of the pool and acclimatize to the temperature before plunging all the way in. Taking on a simple task such as refactoring allows me to mentally move into the zone to work on more challenging, larger problems. Getting Things Done Refactoring is not unique in this, by the way. There are a lot of productive tasks you can use to nudge yourself into the zone. Writing tests, measuring and improving code coverage, fixing known bugs, using static code analyzers, and even spellchecking can help you to be productive and get things done when you just aren’t in the mood to perform major tasks. The key is to not become blocked on any one task. Always have something else (ideally several something elses) ready to go at any time. Sometimes you just need to find the task that fits the time rather than finding the time to fit the task. Refactoring is really a classic case of working smarter, not harder. Although that maxim can be a cliché ripe for Dilbert parody, it really does apply here. This concludes Chapter 1. Chapter 2 commences tomorrow. Why REST? Elliotte Rusty Harold — Mon, 09 Jun 2008 13:18:55 +0000 Here’s part 7 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari. Representational State Transfer (REST) is the oldest and yet least familiar of the three refactoring goals I present here. Although I’ll mostly focus on HTML in this book, one can’t ignore the protocol by which HTML travels. That protocol is HTTP, and REST is the architecture of HTTP. (To be pedantic, REST is actually the architectural style by which HTTP is designed.) Understanding HTTP and REST has important consequences for how you design web applications. Anytime you place a form in a page, or use AJAX to send data back and forth to a JavaScript program, you’re using HTTP. Use HTTP correctly and you’ll develop robust, secure, scalable applications. Use it incorrectly and the best you can hope for is a marginally functional system. The worst that can happen, however, is pretty bad: a web spider that deletes your entire site, a shopping center that melts down under heavy traffic during the Christmas shopping season, or a site that search engines can’t index and users can’t find. Although basic static HTML pages are inherently RESTful, most web applications that are more complex are not. In particular, you must consider REST anytime your application involves the following common things: Forms User authentication Cookies Sessions State These are very easy to get wrong, and more applications to this day get them wrong than right. The Web is not a LAN. The techniques that worked for limited client/server systems of a few dozen to a few hundred users do not scale to web systems that must accommodate thousands to millions of users. Client/server architectures based on sessions and persistent connections are simply not possible on the Web. Attempts to re-create them fail at scale, often with disastrous consequences. REST, as implemented in HTTP, has several key ideas. In brief: All Resources Are Identified by URLs Tagging distinct resources with distinct URLs enables bookmarking, linking, search engine storage, and painting on billboards. It is much easier to find a resource when you can say, “Go to http://www.example.com/foo/bar” than when you have to say, “Go to http://www.example.com/. Type ‘bar’ into the form field. Then press the foo button.” Do not be afraid of URLs. Most resources should be identified only by URLs. For example, a customer record should have a URL such as http://example.com/patroninfo/username rather than http://example.com/patroninfo. That is, each customer should have a separate URL that links directly to their record (protected by a password, of course), rather than all your customers sharing a single URL whose content changes depending on the value of some login cookie. Safe, Side-Effect-Free Operations Such As Querying or Browsing Operate via GET Google can only index pages that are accessed via GET. Users can only bookmark pages that are accessed via GET. Other sites can only link to pages with GET. If you care about raising your site traffic at all, you need to make as much of it as possible accessible via GET. Nonsafe Operations Such As Purchasing an Item or Adding a Comment to a Page Operate via POST Web spiders routinely follow links on a page that are accessible via GET, sometimes even when they are told not to. Users type URLs into browser location bars, and then edit them to see what happens. Browsers prefetch linked pages. If an operation such as deleting content, agreeing to a contract, or placing an order is performed via GET, some program somewhere is going to do it without asking or consulting an actual user, sometimes with disastrous consequences. Entire sites have disappeared when Google discovered them and began to follow “delete this page” links, all because GET was used instead of POST. Each Request Is Independent of All Others The client and server may each have state, but neither relies on the other side remembering what its state is. All necessary information is transferred in each communication. Statelessness enables scalability through caching and proxy servers. It also enables a server to be easily replaced by a server farm as necessary. There’s no requirement that the same server respond to the same client two times in a row. Robust, scalable web applications work with HTTP rather than against it. RESTful applications can do everything that more familiar client/server applications do, and they can do it at scale. However, implementing this may require some of the largest changes to your systems. Nonetheless, if you’re experiencing scalability problems, these can be among the most critical refactorings to make. Continued tomorrow…

New Test
open	/
type	q	elliotte
clickAndWait	btnG
verifyTextPresent	www.elharo.com/