Google and XHTML

Apparently Google does not recognize XHTML, at least not when served as application/xhtml+xml. Try this search which should return exactly one hit pointing to an XHTML document. Notice that the file format is “unrecognized” and they offer to let you “View it as HTML”.

XOM XPath Mapping

File Format: Unrecognized – View as HTML
XOM 1.1 supports XPath 1.0 reasonably faithfully. However there are some differences between the XPath data model and the XOM data model you need to be
www.ibiblio.org/xml/XOM/xpath.xhtml – Similar pagesFilter

I’m not sure what effect if any this has on search engine placement.

Their View as XHTML option mucks up the page somewhat, makes it malformed, gets the character encoding wrong, and replaces the semantic markup with lots of presentational markup.

Some of these problems are repeated on all cached pages, not just the XHTML ones. For one, they include the entire cached page below their additional content (“This is G o o g l e’s cache of http://www.nathansfamous.com/ as retrieved on Mar 6, 2007 03:27:38 GMT. G o o g l e’s cache is the snapshot that we took of the page as we crawled the web…”) This means almost everyone of their cached pages has an out of place html element, head, and DOCTYPE. Here’s an example. The Google additions are blue:


<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">

<BASE HREF="http://www.nathansfamous.com/">
<table border=1 width=100%>
Google text...
</table>

<hr>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
...

On the other hand, they do have their own XHTML search page that’s even sparer than their usual page. It seems to be designed for mobile phones.

7 Responses to “Google and XHTML”

  1. Cormac Says:

    Perhaps you should update your mime-type article with the google spider user-agent 🙂
    Google in general doesn’t seem to “get” xhtml

  2. Josh Peters Says:

    That’s weird. Can you verify this for other pages too? I’m personally curious about two things with the page I saw: the content-style-type header (maybe that’s throwing the spider a wrench?) and the DOCTYPE containing a carriage return. Not that either are incorrect I think, but I’m wondering if either one is messing with Google’s ability to determine the MIME type.

  3. SLV Dweller Says:

    The same kind of results are presented when searching Google for our web site, SLV Dweller.

    search for SLV Dweller

  4. Ian Gregory Says:

    I have been aware of this ever since I switched my site to XHTML. The strange thing is that if you register a mobile sitemap with Google then one of the only three supported file formats is XHTML Mobile Profile, which should of course be served as application/xhtml+xml (the other two are WML and cHTML). You might think Google would recognize the formats that they supposedly support!

    Fortunately it does not seem to adversely affect search rankings and the “view as HTML” option is useful for IE users who can not view my pages directly due to the crippled nature of their browser.

  5. links for 2007-03-09 « My Weblog Says:

    […] The Cafes » Google and XHTML (tags: google xhtml web20) […]

  6. Sebastian Says:

    As far as I know Google has its representation in the W3C XHTML task force the same is as far as XHTML 2.0 is concerned. So it is just a metter of time when they will make it up.
    Best regards
    Sebie http://www.t4tw.info

  7. asartalo Says:

    Forgive me if this is late. I’m interested in content negotiation as well. I’ve checked the search URL you gave and it seems Google now recognizes the content properly. Do you think they tweaked something in their content negotiation algorithm to make that work or has Google fixed their end?