Chapter 2: Tools

Today we start Chapter 2 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

Automatic tools are a critical component of refactoring. Although you can perform most refactoring manually with a text editor, and although I will sometimes demonstrate refactoring that way for purposes of illustration, in practice we almost always use software to help us. To my knowledge no major refactoring browsers are available for HTML at the time of this writing. However, a lot of tools can assist in many of the processes. In this section, I’ll explain some of them.

Backups, Staging Servers, and Source Code Control

Throughout this book, I’m going to show you some very powerful tools and techniques. As the great educator Stan Lee taught us, “With great power comes great responsibility.” Your responsibility is to not irretrievably break anything while using these techniques. Some of the tools I’ll show can misbehave. Some of them have corner cases where they get confused. A huge amount of bad HTML is out there, not all of which the tools discussed here have accounted for. Consequently, refactoring HTML requires at least a five-step process.

  1. Identify the problem.
  2. Fix the problem.
  3. Verify that the problem has been fixed.
  4. Check that no new problems have been introduced.
  5. Deploy the solution.

Because things can go wrong, you should not use any of these techniques on a live site. Instead, make a local copy of the site before making any changes. After making changes to your local copy, carefully verify all pages once again before you deploy.

Most large sites today already use staging or development servers where content can be deployed and checked before the public sees it. If you’re just working on a small personal static site, you can make a local copy on your hard drive instead; but by all means work on a copy and check the changes before deploying them. How to check the changes is the subject of the next section.

Of course, even with the most careful checks, sometimes things slip by and are first noticed by end-users. Sometimes a site works perfectly well on a staging server and has weird problems on the production server due to unrecognized configuration differences. Thus, it’s a very good idea to have a full and complete backup of the production site that you can restore to in case the newly deployed site doesn’t behave as expected. Regular, reliable, tested backups are a must.

Finally, you should very seriously consider storing all your code, including all your HTML, CSS, and images, in a source code control system. Programmers have been using source code control for decades, but it’s a relatively uncommon tool for web developers and designers. It’s time for that to change. The more complex a site is, the more likely it is that subtle problems will slip in unnoticed at first. When refactoring, it is critical to be able to go back to previous versions, maybe even from months or years ago, to find out which change introduced a bug. Source code control also provides timestamped backups so that it’s possible to revert your site to its state at any given point in time.

I strongly recommend Subversion for web development, mostly because of its strong support for moving files from one directory to another, though its excellent Unicode support and decent support for binary files are also helpful. Most source code control systems are set up for programmers who rarely bother to move files from one directory to another. By contrast, web developers frequently reorganize site structures (more frequently than they should, in fact). Consequently, a system really needs to be able to track histories across file moves. If your organization has already set up some other source code control system such as CVS, Visual SourceSafe, ClearCase, or Perforce, you can use that system instead; but Subversion is likely to work better and cause you fewer problems in the long run.

The topic of managing Subversion could easily fill a book on its own; and indeed, several such books are available. (My favorite is Pragmatic Version Control Using Subversion by Mike Mason [The Pragmatic Bookshelf, 2006].) Many large sites hire people whose sole responsibility is to manage the source code control repository. However, don’t be scared off. Ultimately, setting up Subversion or another source code control repository is no harder than setting up Apache or another web server. You’ll need to read a little documentation. You’ll need to tweak some config files, and you may need to ask for help from a newsgroup or conduct a Google search to get around a rough spot. However, it’s eminently doable, and it’s well worth the time invested.

You can check files into or out of Subversion from the command line if necessary. However, life is usually simpler if you use an editor such as BBEdit that has built-in support for Subversion. Plug-ins are available that add Subversion support to editors such as Dreamweaver that don’t natively support it. Furthermore, products such as TortoiseSVN and SCPlugin are available that integrate Subversion support directly into Windows Explorer or the Mac Finder.

Some content management systems (CMSs) have built-in version control. If yours does, you may not need to use an external repository. For instance, MediaWiki stores a record of all changes that have been made to all pages. It is possible at any point to see what any given page looked like at any moment in time and to revert to that appearance. This is critical for MediaWiki’s model installation at Wikipedia, where vandalism is a real problem. However, even private sites that are not publicly editable can benefit greatly from a complete history of the site over time. Although Wikis are the most common use of version control on the Web, some other CMSs such as Siteline also bundle this functionality.

2 Responses to “Chapter 2: Tools”

  1. John Cowan Says:

    Oh please, please tell them not to use Visual SourceSafe! Even @#$% Microsoft doesn’t use VSS any more — that is one can of dog food that is way past its pull date. It stinks on ice. Can you say “random database corruption”?

    If they are using VSS, tell them the first priority is to switch to Subversion before refactoring anything.

  2. Samuel A. Falvo II Says:

    I find distributed version control (e.g., Mercurial, Git, et. al.) to be overwhelmingly superior to Subversion. It is my personal opinion that anyone considering the use of Subversion should look into a DVCS, for the simple reason that it allows collaborators to share an ad-hoc repository amongst themselves without affecting the rest of the organization. When merges are required, the release manager can then pull directly from the ad-hoc repository just as if it were any other employee. This also greatly facilitates code reviews, particularly amongst geographically dispersed coding centers.

    In fact, using a DVCS with a _pull_ (rather than _push_) policy results in all sorts of benefits from the RM’s point of view. First, the RM always pulls from trusted sources. This means that, in order for anyone to commit changes to the code base, it has to go through at least one trusted engineer before it lands in the RM’s patch queue. This pretty much enforces a policy of peer code review. Second, the pull policy reduces the need for process documentation and certification effort (and, thus, reduces the amount of training needed) for commit access, thus bringing new hires up to speed faster.