TagSoup

June 27th, 2008

Here’s part 13 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

John Cowan’s TagSoup (http://home.ccil.org/~cowan/XML/tagsoup/) is an open source HTML parser written in Java that implements the Simple API for XML, or SAX. Cowan describes TagSoup as “a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.”
Read the rest of this entry »

Tidy

June 26th, 2008

Here’s part 13 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

Regular expressions are well and good for individual, custom changes, but they can be tedious and difficult to use for large quantities of changes. In particular, they are designed more to work with plain text than with semistructured HTML text. For batch changes and automated corrections of common mistakes, it helps to have tools that take advantage of the markup in HTML. The first such tool is Dave Raggett’s Tidy (www.w3.org/People/Raggett/tidy/), the original HTML fixer-upper. It’s a simple, multiplatform command-line program that can correct most HTML mistakes.
Read the rest of this entry »

Regular Expressions

June 22nd, 2008

Here’s part 12 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

Manually inspecting and changing each file on even a small site is tedious and often cost-prohibitive. It is much more effective to let the computer do the work by searching for mistakes and, when possible, automatically fixing them. A number of tools support this, including command-line tools such as grep, egrep, and sed; text editors such as jEdit, BBEdit, TextPad, and PSPad; and programming languages such as Java, Perl, and PHP. All these tools provide a specialized search syntax known as regular expressions. Although there are small differences from one tool to the next, the basic regular expression syntax is much the same.

For purposes of illustration, I’m going to use the jEdit text editor as my search and replace tool in this section. I chose it because it provides pretty much all the features you need, it has a reasonable GUI, it’s open source, and it’s written in Java, so it runs on essentially any platform you’re likely to want. You can download a copy from http://jedit.org/.

However, the techniques I’m showing here are by no means limited to that one editor. In my work, I normally use BBEdit instead because it has a slightly nicer interface. However, it’s payware and only runs on the Mac. There are numerous other choices. If you prefer a different program, by all means use it. What you’ll need are:

  • Full regular expression search and replace
  • The ability to recursively search a directory
  • The ability to filter the files you search
  • A tool that shows you what it has changed, but does not require you to manually approve each change
  • Automatic recognition of different character encodings and line-ending conventions

Any tool that meets these criteria should be sufficient.

Read the rest of this entry »

Incompetent Boobs Part 2

June 20th, 2008

Stupid user stories are a tradition in I.T., and there’s a whole subgenre of clueless manager/boss/executive stories. However for once this is a story where the manager was absolutely right, and the I.T. staff (or at least the incompetent boobs who built this system, if not for the poor schmucks who had to maintain it) were colossally wrong, with devastating consequences. And to make matters worse they still don’t realize what they did wrong or how to fix it.

Here’s the story from Andrew Brandt at InfoWorld:

Being part of an online community can reap rich rewards. Allowing the tools that fuel those communities to wreak havoc on your company Web site — well, that’s probably not what you had in mind.

Of course, when it’s your boss who is insisting on tapping those tools, sometimes you have to buck hierarchy and sneak behind his back to help him toe the prudent IT line, as the administrator of a business-to-business Web site quickly found out.

The tool in question was a toolbar called Alexa, which tracks the surfing habits of its users and spiders Web sites to build a ranking system for comparing the popularity of Web sites. The admin debated the value of the toolbar with his boss often, though perhaps “debate” is too delicate a term.

“I told him time and again to uninstall it, and even did so myself a number of times, but he’d put it back every time,” the admin says.

“Then, one day, all dynamic content on the main page [of the b-to-b’s Web site] just vanished. I brought it back from backup and chalked it up to a bug. Then it happened again a little while later. I started snooping around our logs,” he says.

As it turns out, Alexa’s spiders had been ignoring the robots.txt file — and were instead capturing usernames and passwords.

“It logged into the administrative area and followed the ‘delete’ link for every entry,” the admin says. “My dumb-ass boss still didn’t want to uninstall Alexa — could have strangled the man.”

Fallout: The data was restored, with some difficulty, and Alexa’s spider was prevented, through other means, from accessing the administrative side of the Web site.

Moral: When confronted with the classic pointy-haired boss, Machiavellian subterfuge sometimes becomes necessary. Try using the Image File Execution Options registry key to prevent Alexa — or whatever undesirable, dangerous, or obnoxious program he or she keeps using to make your life miserable — from running.

Unfortunately Brandt draws the wrong moral from this story, or at least not the most important one.
Read the rest of this entry »

Testing

June 18th, 2008

Here’s part 11 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

This part’s a little funny because really it deserves an entire book on its own, and that book has yet to be written. I didn’t have space or time to write a complete second book about test driven development of web sites and web applications, but perhaps this small piece will inspire someone else to do it. If not, maybe I’ll get to it one of these days. 🙂

In theory, refactoring should not break anything that isn’t already broken. In practice, it isn’t always so reliable. To some extent, the catalog later in this book shows you what changes you can safely make. However, both people and tools do make mistakes; and it’s always possible that refactoring will introduce new bugs. Thus, the refactoring process really needs a good automated test suite. After every refactoring, you’d like to be able to press a button and see at a glance whether anything broke.

Although test-driven development has been a massive success among traditional programmers, it is not yet so common among web developers, especially those working on the front end. In fact, any automated testing of web sites is probably the exception rather than the rule, especially when it comes to HTML. It is time for that to change. It is time for web developers to start to write and run test suites and to use test-driven development.

The basic test-driven development approach is as follows:

  1. Write a test for a feature.
  2. Code the simplest thing that can possibly work.
  3. Run all tests.
  4. If tests passed, goto 1.
  5. Else, goto 2.

For refactoring purposes, it is very important that this process be as automatic as possible. In particular:

  • The test suite should not require any complicated setup. Ideally, you should be able to run it with the click of a button. You don’t want developers to skip running tests because they’re too hard to run.
  • The tests should be fast enough that they can be run frequently; ideally, they should take 90 seconds or less to run. You don’t want developers to skip running tests because they take too long.
  • The result must be pass or fail, and it should be blindingly obvious which it is. If the result is fail, the failed tests should generate more output explaining what failed. However, passing tests should generate no output at all, except perhaps for a message such as “All tests passed”. In particular, you want to avoid the common problem in which one or two failing tests get lost in a sea of output from passing tests.

Writing tests for web applications is harder than writing tests for classic applications. Part of this is because the tools for web application testing aren’t as mature as the tools for traditional application testing. Part of this is because any test that involves looking at something and figuring out whether it looks right is hard for a computer. (It’s easy for a person, but the goal is to remove people from the loop.) Thus, you may not achieve the perfect coverage you can in a Java or .NET application. Nonetheless, some testing is better than none, and you can in fact test quite a lot.

One thing you will discover is that refactoring your code to web standards such as XHTML is going to make testing a lot easier. Going forward, it is much easier to write tests for well-formed and valid XHTML pages than for malformed ones. This is because it is much easier to write code that consumes well-formed pages than malformed ones. It is much easier to see what the browser sees, because all browsers see the same thing in well-formed pages and different things in malformed ones. Thus, one benefit of refactoring is improving testability and making test-driven development possible in the first place. Indeed, with a lot of web sites that don’t already have tests, you may need to refactor them enough to make testing possible before moving forward.

You can use many tools to test web pages, ranging from decent to horrible and free to very expensive. Some of these are designed for programmers, some for web developers, and some for business domain experts. They include:

  • HtmlUnit
  • JsUnit
  • HttpUnit
  • JWebUnit
  • FitNesse
  • Selenium

In practice, the rough edges on these tools make it very helpful to have an experienced agile programmer develop the first few tests and the test framework. Once you have an automated test suite in place, it is usually easier to add more tests yourself.

Read the rest of this entry »