Why Refactor HTML?

Here’s part 2 of the ongoing serialization of Refactoring HTML, also available from Amazon and Safari.

How do you know when it’s time to refactor? What are the smells of bad code that should set your nose to twitching? There are quite a few symptoms, but these are some of the smelliest.

Smell: Illegible Code

The most obvious symptom is that you do a View Source on the page and it might as well be written in Greek (unless, of course, you’re working in Greece). Most coders know ugly code when we see it. Ugly code looks ugly. Which would you rather see, Listing 1.1 or Listing 1.2? I don’t think I have to tell you which is uglier, and which is going to be easier to maintain and update.

Listing 1.1 Dirtier Code

<TABLE BORDER="0" CELLPADDING="0" CELLSPACING="0" WIDTH="100%">
<TR><TD WIDTH="70"> <A HREF="http://www.example.com/" TARGET=
"_blank"
>
<IMG SRC="/images/logo-footer.gif"
HSPACE = 5 VSPACE="0" BORDER="0"></A></TD>
<td class="footer" VALIGN="top"> &#169;2007 <A HREF="http://www.example.com/" TARGET="_blank">Example Inc.</A>.
All rights reserved.<br>
<A HREF="http://www.example.com/legal/index.html" TARGET="_blank">Legal Notice</A> -
<A HREF="http://www.example.com/legal/privacy.htm" TARGET="_blank">Privacy Policy</A> - <A HREF="http://www.example.com/legal/permissions.html"
TARGET="_blank">

Permissions</A>
</td>
</TR></TABLE>

Listing 1.2 Cleaner Code

<div id='footer'>
  <a href="http://www.example.com/">  
    <img src="/images/logo-footer.gif" alt="Example Inc." width='70' height='41' />
  </a>
  <ul> 
    <li>© 2007 <a href="http://www.example.com/">Example Inc.</a>.
      All rights reserved.</li>
   <li><a href="http://www.example.com/legal/index.html"> 
  Legal Notice
  </a></li>
  <li><a href="http://www.example.com/legal/privacy.htm">
    Privacy Policy
  </a></li>
  <li><a href="http://www.example.com/legal/permissions.html">
   Permissions
   </a></li>
 </ul>
</div>

Now, you may object that in Listing 1.2 I haven’t merely reformatted the code. I’ve also changed it. For instance, a table has turned into a div and a list, and some hyphens have been removed. However, Listing 1.2 is actually much closer to the meaning of the content than Listing 1.1. Listing 1.2 may here be assumed to use an external CSS stylesheet to supply all the formatting details I removed from Listing 1.1. As you’ll see, that’s going to be one of the major techniques you use to refactor pages and clean up documents.

I’ve also thrown away the TARGET=”_blank” attributes that open links in new windows or tabs. This is usually not what the user wants, and it’s rarely a good idea. Let the user use the back button and history list if necessary, but otherwise open most links in the same window. If users want to open a link in a new window, they can easily do so from the context menu, but the choice is now theirs. Sometimes half the cleanup process consists of no more than removing pieces that shouldn’t have been there in the first place.

Line Length

Listing 1.2 is still a little less than ideal. I’m a bit constrained by the need to fit code within the margins of this printed page. In real source code, I could fit a little more onto one line. However, don’t take this to extremes. More than 80 or so characters per line becomes hard to read, and is itself a minor code smell.

A small exception can be made here for code generated out of a content management system (CMS) of some kind. In this case, the code you see with View Source is not really the source code. It’s more of a compiled machine format. In this case, it’s the input to the CMS that should look good and appear legible.

Nonetheless, it’s still better if tools such as CMSs and web editors generate clean, well-formed code. Surprisingly often, you’ll find that the code the tool generates is a start, not an end. You may want to add stylesheets, scripts, and other things to the code after the tool is through with it. In this case, you’ll have to deal with the raw markup, and it’s a lot easier to do that when it’s clean.

Smell: The CEO Can’t Fill Out His Travel Expense Vouchers

Usability on the Web has improved in the past few years, but not nearly as much as it can or should. All but the best sites can benefit by refocusing more on the readers and less on the writers and the designers. A few simple changes aimed at improving usability—such as increasing the font size (or not specifying it at all) or combining form fields—can have disproportionately great returns in productivity. This is especially important for intranet sites, and any site that is attempting to sell to consumers.

Smell: Slow Page-Rendering Times

If any major browser takes more than half a second to display a page, you have a problem. This one can be a little hard to judge, because many slow pages are caused by network latency or overloaded databases and HTTP servers. These are problems too, though they are ones you usually cannot fix by changing the HTML. However, if a page saved on a local file system takes more than half a second to render in the web browser, you need to refactor it to improve that time.

Smell: Pages Appear Different in Different Browsers

Pages do not need to look identical in different browsers. However, all content and functionality should be accessible to everyone using any reasonably current browser. If the page is illegible or nonfunctional in Safari, Opera, Internet Explorer, or Firefox, you have a problem. For instance, you may see the page starting with a full-screen-width sidebar, followed by the content pane. Alternatively, the sidebar may show up below the content rather than above it. This usually means the page looks perfectly fine in the author’s browser. However, she did not bother to check it in the one you’re using. Be sure to check your pages in all the major browsers.

Anytime you see something like “Best Viewed with Internet Explorer,” you have a code smell and refactoring is called for. Anytime you see something like Figure 1.1, you have a huge code smell, and one that all your readers can smell too. Internet Explorer has less than 80% market share, and that’s dropping fast. In fact, even that is probably vastly overestimated because most spiders and bots falsely identify themselves as IE, and they account for a disproportionate number of hits. Mac OS X and Linux users don’t even have an option to choose Internet Explorer. The days when you could design your site for just one browser are over.

Figure 1.1 Wal-mart locks out non-IE users.

We're sorry. Our website requires the use of Internet Explorer version 6 or higher. It appears that you are using Firefox, Safari, or another browser that wal-mart Video Downloads doesn't currently support. Click here to get Internet Explorer for free from Microsoft.

A common variant of this is requiring a particular screen size—for instance, “This page is best viewed with a screen resolution of 1024 x 768. To change your monitor/display resolution, go to…” Well-designed web pages do not require any particular screen size or browser.

Smell: Pages Require Dangerous or Nonstandard Technologies

Many sites require cookies, JavaScript, Flash, PDF, Java, or other non-HTML technologies. Although all of these have their place, they are vastly overused on the Web. They are not nearly as interoperable or reliable in the wild as most web designers think. They are all the subject of frequent security notices telling users to turn them off in one browser or another to avoid the crack of the week. They are often unsupported by Google and most other search engine robots. Consequently, you should strive to make sure that most pages on your site function properly even if these technologies are unavailable.

Fortunately, the code smells here are really obvious and really easy to detect. Anytime you see a notice such as this, you have a problem:

Cookies Required

Sorry, you must accept cookies to access this site.

In order to proceed on this site, you must enable cookies on your Internet browser. We use cookies to tailor our website to your needs, to deliver a better, more personalized service, and to remember certain choices you’ve made so you don’t have to re-enter them.

Not only is this annoying to users, but these sites are routinely locked out of Google and get hideous search engine placement.

Embarrassingly, this next example actually comes from a page that’s talking about cleaning up HTML:

Notice: There is a Table of Contents, but it is dynamically generated. Please enable JavaScript to see it.

The right way to do dynamic content is to use server-side templating, but still sending static HTML to the client.

One site I found managed to hit almost all of these:

This site uses JavaScript, Cookies, Flash, Pop-up windows, and is designed for use with the latest versions of Internet Explorer, Netscape Navigator (NOT Netscape 6), and Opera.

If only they had asked for a specific screen size, they would have hit the superfecta.

This site also added a new one. I had forgotten about pop-ups. Given the rampant abuse of pop-ups and the consequent wide deployment of pop-up blockers, no legitimate site should rely on them.

Of course, some things you can only do with JavaScript or other non-HTML technologies. I don’t intend to tell you not to design the next Google Maps or YouTube, if that is indeed what you’re trying to do. Just try to keep the fancy tricks to a minimum, and make sure everything you can do without Java/JavaScript/Flash/and so on is done without those technologies. This Flickr message is a lot less troublesome:

To take full advantage of Flickr, you should use a JavaScript-enabled browser and install the latest version of the Macromedia Flash Player.

The key difference is that I saw this on a page that still managed to show me the content I’d come to see, despite disabling JavaScript and Flash. I may not see everything, or have full functionality, but I’m not locked out. This is much friendlier to the reader and to search engines such as Google.

As a site developer, I’d still take a second look at this page to see if I might be able to remove some of the requirements on clients. However, it wouldn’t be my first priority.

Smell: Your Company’s Home Page Suddenly Says, “Pwned by Elite Doodz”

Web-site defacements are a major wakeup call, and one that gets everybody’s attention really quick. This can happen for a number of reasons, but by far the most common is a code injection attack directed at a poorly designed form processing script.

Frankly, if all that happens is that your web site is defaced, you’re lucky and you should thank the hackers who pointed this out to you. More serious attacks can steal confidential data or erase critical information.

Smell: Your First Appearance on Google Is on Page 17

Search engine optimization is a major driver for web-site refactoring. Search engines value text over images, and early text over later text. They don’t understand table layouts, and they don’t much care for cookies or JavaScript. However, they do love unique titles and maybe even meta tags.

Smell: Readers Send E-mail Saying Your Site Is Broken

This is one of the absolute best ways to find out you have a problem. For example, I recently received this e-mail from one of my readers:

The links in the “Further Reading” section of Cafe au Lait to “The Next Big Language?” and “Testing HopStop” are broken.

Best regards,

Kent

That was a bit of a surprise because the section Kent was complaining about was automatically generated using XSLT that transformed an Atom feed from another site. I checked the feed and it was correct. However, Kent was right and the link was broken. I eventually tracked it down to a bug in the XSLT stylesheet. It was reading an element that was usually but not always the same as the link, rather than the element that was indeed the link. Five minutes later the site was fixed.

Ninety-nine percent of your readers will just grumble and never tell you that your site is broken. The 1% who do complain are gold. You need to treat them well and listen to what they say. Do not make it hard for them to find you. Every site and almost every page should have an obvious person to contact for any problems that arise. These responses need to be carefully considered and acted on quickly.

Readers may also send you e-mail about many things not directly related to the site: canceled orders, shipping dates, link requests, political disagreements, and a thousand other things. You need to be able to separate the technical problems from the nontechnical ones so that the correspondence can be routed appropriately. Some sites use an online form and ask readers to self-classify the problem. However, this is unreliable because readers don’t usually think of the site in the same way the developers do. For example, if a customer can’t enter a nine-digit ZIP Code with a hyphen into your shipping address form, you may think of that as a technical mistake (and you’d be right), but the customer is likely to classify it as a shipping problem and direct it to a department that won’t even understand the question, much less know what to do about it. You may need a triage person or team that identifies each piece of e-mail and decides who in your organization is the right person to respond to it. This is a critical function that should not be outsourced to the lowest bidder.

Whatever you do, do not let problem reports drop into the black hole of customer service. Make sure that the people who have the power to fix the problems receive feedback directly from the users of the site, and that they pay attention to it. Too many sites use e-mail and contact forms to prevent users from reaching them and firewall developers off from actual users. Do not fall into this trap. Web sites pay a lot of money to hire QA teams. If people volunteer to do this for you, love them for it and take advantage of them.

Continued tomorrow…

10 Responses to “Why Refactor HTML?”

  1. jonathan peterson Says:

    Please tell me you’re going to replace that © with an &copy in the next installment… 🙂

  2. Mark A Hershberger Says:

    I hope he doesn’t. Unicode > entities.

  3. Adrian Says:

    Great series of articles, I’m nodding in more or less total agreement, and looking forward to the next!

    I’d just like to note Flash has been indexable by Google et al for a while now. With the coming mass adoption of RIA via Flex I think some Flash myths need to be debunked, Flash being non indexable being one of them.

  4. Asbjørn Ulsberg Says:

    Excellent run-through! I agree on all points and don’t have much to add, because the list is already so extensive. Well done! 🙂

  5. Bahador Says:

    Great read!
    Thanks!
    About browser compatibility; I find it really hard to justify testing against Opera (and even Safari) in my company; there’s just a lot of people that say it’s a waste of time.

  6. Theo Says:

    If I have understood the theme of this book correctly I don’t think that the you-have-been-pwnd-smell is a valid smell, not in the context of refactoring HTML . Websites don’t get hacked because they have badly crafted HTML, but because the web application that generates that HTML is insecure. HTML and the code that generates it are often intertwined, but it doesn’t matter how much you refactor the HTML, the security issues (like SQL injection for example) will still be there, because they are about what data the site accepts and lack of proper checks on that data (which cannot be solved by client side validation of forms, although that helps).

    @Adrian: I’m sorry but you are actually contributing to the spread of another Flash myth: that Google indexing SWF-files is actually going to change anything. SWF-files generally contain very little content (content is usually loaded in XML files once the Flash application has started running), and even those SWF-files that do contain content do not have any semantic structure that Google can use to make sense of the file, it’s more or less equivalent to running the “strings” command on a desktop application, you get a lot of text, but you can’t make any sense of it. Moreover, any web application that requires the user to log in is even more opaque to Google. That Google indexes web applications is as useful as Google Desktop indexing desktop applications. I’m not saying that it doesn’t, just that it doesn’t help.

    Thanks for an interesting read, I will dive right in to the next post now.

  7. Elliotte Rusty Harold Says:

    Theo,

    The later chapters of the book actually go beyond pure HTML into the realm of web applications; and in those areas SQL injection, form misuse, and other issues are very relevant. Chapter 7 specifically addresses these issues.

  8. Theo Says:

    Then I take my comment back.

  9. Kent Says:

    Hey, I’m a published author! I’ll have to keep sending those bug reports, maybe I’ll make it into your next book 🙂

  10. Flüge Los Angeles Says:

    Well writen article/series of articles. I really appreciate that you are sharing your knowledge. I very dislike the missing standards, so that you need to have 6 different browsers to use the internet properly. Regarding to Walmart, I really believe that they are payed by Microsoft to only allow the IE for their website. BTW.: I´m looking for a disassembler to translate into c++, do you know something good? Thanks