Regular Expressions
Here’s part 12 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.
Manually inspecting and changing each file on even a small site is tedious and often cost-prohibitive. It is much more effective to let the computer do the work by searching for mistakes and, when possible, automatically fixing them. A number of tools support this, including command-line tools such as grep, egrep, and sed; text editors such as jEdit, BBEdit, TextPad, and PSPad; and programming languages such as Java, Perl, and PHP. All these tools provide a specialized search syntax known as regular expressions. Although there are small differences from one tool to the next, the basic regular expression syntax is much the same.
For purposes of illustration, I’m going to use the jEdit text editor as my search and replace tool in this section. I chose it because it provides pretty much all the features you need, it has a reasonable GUI, it’s open source, and it’s written in Java, so it runs on essentially any platform you’re likely to want. You can download a copy from http://jedit.org/.
However, the techniques I’m showing here are by no means limited to that one editor. In my work, I normally use BBEdit instead because it has a slightly nicer interface. However, it’s payware and only runs on the Mac. There are numerous other choices. If you prefer a different program, by all means use it. What you’ll need are:
- Full regular expression search and replace
- The ability to recursively search a directory
- The ability to filter the files you search
- A tool that shows you what it has changed, but does not require you to manually approve each change
- Automatic recognition of different character encodings and line-ending conventions
Any tool that meets these criteria should be sufficient.
Searching
The first goal of a regular expression is to find things that may be wrong. For example, I recently noticed that I had mistyped some dates as 20066 instead of 2006 in one of my files. That’s an error that’s likely to have happened more than once, so I checked for it by searching for that string.
In jEdit, you perform a multifile search using the Search/Search in Directory menu item. Selecting this menu item brings up the dialog shown in Figure 2.6. This is normally configured more or less as shown here.
- The string you’re searching for (the target string) goes in the first text field.
- The string that will replace the target string goes in the second text field. Here I’m just going to find, not replace, so I haven’t entered a replacement string.
- The Directory radio button is checked to indicate that you’re going to search multiple files. You can also search just in the current file, or even the current selection.
- The filter is set to *.html to search only those files that end in .html. You can modify this to search different kinds of or subsets of files. For instance, I often want to search only my old news files, which are named news2000.html, news2001.html, news2002.html, and so on. In that case, I would set the filter to news2.*html. I could search even older files including news1999.html by rewriting the filter regular expression in a form such as news\d\d\d\d.html.
- I specify the directory where I’ve stored my local copy of the files I’m searching. In my case, this is /Users/elharo/Cafe au Lait/javafaq.
- “Search subdirectories” is checked. If it weren’t, jEdit would search only the javafaq directory, but not any directories that directory contains.
- “Keep dialog” is checked. This keeps the dialog box open after the search is completed.
- “Ignore case” is checked. This will allow the regular expression to match regardless of case. This isn’t always what you want, but more often than not it is.
- “Regular expressions” is checked. You don’t need to check this when you’re only searching for a constant string, as here. However, most searches are more complex than that.
- HyperSearch is checked. This will bring up a window showing all matches, rather than just finding the next match.
Figure 2.6: jEdit multifile search
Fortunately, that particular problem seems to have been isolated. However, I also recently noticed another, more serious problem. For some unknown reason, I somehow had managed to write links with double equals signs, as shown here, throughout one of my sites:
<a href=="../../index.html">Cafe au Lait</a>
Consequently, links were breaking all over the place. The first step was to find out how broad the problem was. In this case, the mistaken string was constant, and was unlikely to appear in correct text, so it was easy to search for. This problem turned up 4,475 times in 476 files, as shown in the HyperSearch results in Figure 2.7.
Figure 2.7: jEdit search results
When there aren’t a lot of mistakes, you can click on each one to open the document and fix it manually. Sometimes this is needed. Sometimes this is even the easiest solution. However, when there are thousands of mistakes, you have to fix them with a tool. In this case, the solution is straightforward. Put href= in the “Replace with” field; then click the “Replace all” button.
Do be careful when performing this sort of operation, though. A small mistake can cause bigger problems. A bad search and replace likely caused this problem in the first place. You should test your regular expression search and replace on a few files first before trying it on an entire site.
Most important, always work on a backup copy of the site; always run your test suite after each change; and always spot-check at least some of the files that have been changed to make sure nothing went wrong. If something does go wrong, an editor with undo capability can be very useful. Not all editors support multifile undo with a buffer that’s large enough to handle thousands of changes. If yours doesn’t, be ready to delete your working copy and replace it with the original in case the search goes wrong. Like any other complex bit of code, sometimes you have to try several times to fully debug a regular expression.
Search Patterns
Often, you don’t know exactly what you’re searching for, but you do know its general pattern. For example, if you’re searching for years in the recent past, you might want to find any four-digit number beginning with 200. You may want to search for attribute name=value pairs, but you’re not sure whether they’re in the format name=value, name=’value’, or name=”value”. You may want to search for all <p>
start-tags, whether they have attributes or not. These are all good candidates for regular expressions.
In a regular expression, certain characters and patterns stand in for a set of other characters. For example, \d means any digit. Thus, to search for any year from 2000 to 2009, one could use the regular expression 200\d. This would match 2000, 2001, 2002, and so on through 2009.
However, the regular expression 200\d also matches 12000, 200032, 12320056, and other strings that are probably not years at all. (To be precise, it matches the substrings in the form 200\d, not the entire string.) Thus, you might want to indicate that the string you’re matching must be preceded and trailed by whitespace of some kind. The metacharacter \s matches whitespace, so we can now rewrite the expression as \s200\d\s to match only those strings that look like years in this decade.
Of course, there’s still no guarantee that every string you match in this form is a year. It could be a price, a population, a score, a movie title, or something else. You’ll want to scan the list of matches to verify that it is what you expect. False positives are a real concern, especially for simple cases such as this. However, it’s normally possible to either further refine the regular expression to avoid any false positives or manually remove the accidental matches.
There usually are other ways to do many things. For instance, we could write this search as \b200\d\b. The metacharacter \b matches the beginning or end of a word, without actually selecting any characters. This would avoid the whitespace at the beginning and end of words. This would also allow us to recognize a year that came at the end of a sentence right before a period, as in “This is 2008”. However, it can’t distinguish periods from decimal points and would also match the 2005 in 2005.3124.
You could even simply list the years separated by the OR operator, |, like so:
2000|2001|2002|2003|2004|2005|2006|2007|2008|2009
However, this still has the word boundary problems of the previous matches.
Sometimes you stop with a search. In particular, if the content is generated automatically from a CMS, template page, or other program, the search is used merely to find bugs: places where the program is generating incorrect markup. You then must change the program to generate correct markup. If this is the case, false positives don’t worry you nearly so much because all changes will be performed manually anyway. The search only identifies the bug. It doesn’t fix it.
If you don’t stop with a search, and you go on to a replacement, you need to be cautious. Regular expressions can be tricky, and ones involving HTML are often much trickier than the textbook examples. Nonetheless, they are invaluable tools in cleaning up HTML.
Note
If you don’t have a lot of experience with regular expressions, please refer to Appendix 1 for many more examples. I also recommend Mastering Regular Expressions, 3rd Edition, by Jeffrey E. F. Friedl (O’Reilly, 2006).
June 25th, 2008 at 4:40 pm
JEdit has been my favourite editor for several years. It’s search and replace functions are really nice, but there’s one thing you need to be careful about: when you replace text in files, the editor loads all the modified files. If the number of files is very large, this behaviour can choke the editor. So always perform a hypersearch to check the number of files before you start batch-replacing strings. For string replacements across hundreds of files, there are other tools; batch replacement across large numbers of files also works quite will in Eclipse.