a test of restructured text in zwiki - this is a copy of http://www.hixie.ch/advocacy/xhtml .

Sending XHTML as text/html Considered Harmful

Author: Ian Hickson <ian@hixie.ch> (Comments welcome.)

Abstract

A number of problems resulting from the use of the text/html MIME type in conjunction with XHTML content are discussed. It is suggested that XHTML delivered as text/html is broken and XHTML delivered as text/xml is risky, so authors intending their work for public consumption should stick to HTML 4.01, and authors who wish to use XHTML should deliver their markup as application/xhtml+xml.

Other versions

Une traduction française est disponible:
http://www.hixie.ch/advocacy/xhtml.fr
The Safari development team posted a blog entry on this topic:
http://webkit.org/blog/?p=68

Context

This was originally written in September 2002 in the context of this Web log entry:

http://ln.hixie.ch/?start=1031465247&count=1

It has since been regularly updated to correct errors that have been brought up in various mailing lists and other discussion forums. As of late 2004, it is still just as relevant as when it was originally written.

Note that this document compares XHTML 1.0 compliant to appendix C to HTML 4.01, because that is the only variant of XHTML that may be sent as text/html.

Executive Summary

If you use XHTML, you should deliver it with the application/xhtml+xml MIME type. If you do not do so, you should use HTML4 instead of XHTML. The alternative, using XHTML but delivering it as text/html, causes numerous problems that are outlined below.

Unfortunately, IE6 does not support application/xhtml+xml (in fact, it does not support XHTML at all).

Why using text/html for XHTML is bad

What usually happens to authors who decide to send XHTML as text/html is the following:

  1. Authors write XHTML that makes assumptions that are only valid for tag soup or HTML4 UAs?, and not XHTML UAs?, and send it as text/html. (The common assumptions are listed below.)
  2. Authors find everything works fine.
  3. Time passes.
  4. Author decides to send the same content as application/xhtml+xml, because it is, after all, XHTML.
  5. Author finds site breaks horribly. (See below for a list of reasons why.)
  6. Author blames XHTML.

Steps 1 to 5 have been seen by every single person I have spoken to who has switched to using the XHTML MIME type. The only reason step 6 didn't happen in those cases is that they were advanced authors who understood how to fix their content.

SPECIFIC PROBLEMS

These are the issues that affect documents when they are switched from text/html to application/xhtml+xml:

  • &lt;disabled script&gt; and <style> elements in XHTML sent as text/html have to be escaped using ridiculously complicated strings.

    This is because in XHTML, &lt;disabled script&gt; and <style> elements are #PCDATA blocks, not #CDATA blocks, and therefore <!-- and --> really _are_ comments tags, and are not ignored by the XHTML parser. To escape script in an XHTML document which may be handled as either HTML4 or XHTML, you have to use:

    &lt;disabled script type="text/javascript"&gt;<!--//--><![CDATA[//><!--

    ...

    //--><!]]>&lt;disabled /script&gt;

    To embed CSS in an XHTML document which may be handled as either HTML4 or XHTML, you have to use:

    <style type="text/css"><!--/--><![CDATA[/><!--*/

    ...

    /]]>/--></style>

    Yes, it's pretty ridiculous. If documents _aren't_ escaped like this, then the contents of &lt;disabled script&gt; and <style> elements get dropped on the floor when parsed as true XHTML.

    (This is all assuming you want your pages to work with older browsers as well as XHTML browsers. If you only care about XHTML and HTML4 browsers, you can make it a bit simpler.)

  • A CSS stylesheet written for an HTML4 document is interpreted slightly differently in an XHTML context (e.g. the <body> element is not magical in XHTML, tag names must be written in lowercase in XHTML). Thus documents change rendering when parsed as XHTML.

  • A DOM-based script written for an HTML4 document has subtly different semantics in an XHTML context (e.g. element names are case insensitive and returned in uppercase in HTML4, case sensitive and always lowercase in XHTML; you have to use the namespace-aware methods in XHTML, but not in HTML4). BUT, if you send your documents as text/html, then they will use the HTML4 semantics DESPITE being XHTML! Thus, scripts are highly likely to break when the document is parsed as XHTML.

  • Scripts that use document.write() will not work in XHTML contexts. (You have to use DOM Core methods.)

  • Current UAs? are, for text/html content, HTML4 user agents (at best) and certainly not XHTML user agents. Therefore if you send them XHTML you are sending them content in a language which is not native to them, and instead relying on their error handling. Since this is not defined in any specification, it may vary from one user agent to the other.

  • XHTML documents that use the "/>" notation, as in "<link />" have very different semantics when parsed as HTML4. So if there was to be a fully compliant HTML4 UA, it would be quite correct to show ">" characters all over the page.

    For more details on this see the third bullet point in the section entitled "The Myth of "HTML-compatible XHTML 1.0 documents".

COPY AND PASTE

The worst problem, and the main reason (I suspect) for most of the REALLY invalid XHTML pages out there, is that authors who have no clue about XHTML simply copy and pasted their DOCTYPE from another document. So even if you write valid XHTML, by using XHTML, you are likely to encourage authors who do not know enough to write valid XHTML to claim to do so.

Why trying to use XHTML and then sending it as text/html is bad

These are not likely to be problems for authors who regularly validate their pages, but other authors will run into these problems.

  • Documents sent as text/html are handled as tag soup [1]? by most UAs?.

    This is the key. If you send XHTML as text/html, as far as browsers are concerned, you are just sending them Tag Soup. It doesn't matter if it validates, they are just going to be treating it the same was as plain old HTML 3.2 or random HTML garbage.

    Since most authors only check their documents using one or two UAs?, rather than using a validator, this means that authors are not checking for validity, and thus most documents that claim to be XHTML on the web now are invalid.

    See, for example, this study:

    http://www.goer.org/Journal/2003/Apr/index.html#results

    ...but if you don't believe it, feel free to do your own. In any random sample of documents that appear to claim to be XHTML, the overwhelming majority of documents are invalid.

    Therefore the main advantage of using XHTML, that errors are caught early because it _has_ to be valid, is lost if the document is then sent as text/html. (Yes, I said _most_ authors. If you are one of the few authors who understands how to avoid the issues raised in this document and does validate all their markup, then this document probably does not apply to you -- see Appendix B.)

  • If you ever switch your documents that claim to be XHTML from text/html to application/xhtml+xml, then you will in all likelyhood end up with a considerable number of XML errors, meaning your content won't be readable by users. (See above: most of these documents do not validate.)

  • If a user saves such an text/html document to disk and later reopens it locally, triggering the content type sniffing code since filesystems typically do not include file type information, the document could be reopened as XML, potentially resulting in validation errors, parsing differences, or styling differences. (The same differences as if you start sending the file with an XML MIME type.)

  • The only real advantage to using XHTML rather than HTML4 is that it is then possible to use XML tools with it. However, if tools are being used, then the same tools might as well produce HTML4 for you. Alternatively, the tools could take SGML as input instead of XML. (SGML is over a decade older than XML and the tools have existed for years.)

  • HTML 4.01 contains everything that XHTML 1.0 contains, so there is little reason to use XHTML in the real world. It appears the main reason is simply "jumping on the bandwagon" of using the latest and (perceived) greatest thing.

The Myth of "HTML-compatible XHTML 1.0 documents"

RFC 2854 spec refers to "a profile of use of XHTML which is compatible with HTML 4.01". There is no such thing. Documents that follow the guidelines in appendix C are not valid HTML 4.01 documents. They just happen to be close enough that tag soup parsers are able to handle them just like most of the other pages on the Web.

The simplest examples of this are:

  • The "/>" empty tag syntax actually has totally different meaning in HTML4. (It's the SHORTTAG minimisation feature known as NET, if I recall the name correctly.) Specifically, the XHTML

    <p> Hello <br /> World </p>

    ...is, if interpreted as HTML4, exactly equivalent to:

    <p> Hello <br>&gt; World </p>

    ...and should really be rendered as:

    Hello > World

  • Script and style elements cannot have their contents hidden from legacy UAs?. The following XHTML:

    <style type="text/css"> <!-- /* hide from old browsers */

    p { color: red; }

    --> </style>

    ...is exactly equivalent to the following HTML4:

    <style type="text/css">

    </style>

    ...because comments are not ignored in XHTML <style> blocks.

  • The "xmlns" attribute is invalid HTML4.

  • The XHTML DOCTYPEs? are not valid HTML4 DOCTYPEs?.

Using XHTML and sending it as text/html is effectively the same, from an HTML4 point of view, as writing tag soup (see "Why UAs? can't handle XHTML sent as text/html as XML" below).

Note: This is covered by HTMLWG issue XHTML-1.0/6232:
http://hades.mn.aptest.com/cgi-bin/voyager-issues/XHTML-1.0?id=6232;expression=appendix%20c;user=guest

Why UAs? can't handle XHTML sent as text/html as XML

  • Documents sent as text/html are handled as tag soup by most UAs?. This means that authors are not checking for validity, and thus most XHTML documents on the web now are invalid. A conforming XML UA would thus be unable to show as many documents as current UAs?, and would therefore never get enough marketshare to be relevant.

  • It is impossible to reliably autodetect XHTML when sent as text/html. This is why UAs? could not ever treat text/html documents as XML, even if they did not care about not being usable (see the first point in this section).

    • You can't sniff for the five characters "<?xml" because:
      • The <?xml ... ?> header is optional per Appendix C, and it is recommended not to include it as it causes IE6 to trigger quirks mode.
      • SGML can also contain PIs? (see the example below).
    • You can't trigger from the DOCTYPE since the W3C might introduce new XHTML DOCTYPEs? in future, so you don't know which DOCTYPEs? to look for. (Not to mention that DOCTYPEs? are optional for well-formed XHTML documents, DOCTYPE parsing is hard, DOCTYPEs? may be hidden in comments, and DOCTYPE sniffing has been called harmful by many leading figures at the W3C and elsewhere.)
    • You can't trigger off the "<html xmlns" string because it might be there but hidden in a comment (you'd need a complete XML parser to step past comments, PIs?, internal subsets, etc).

    e.g. what language is this text/html document in?:

    <?xml this is not?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0//EN"

    [ <!-- SYSTEM "not XHTML" --> ]?>

    <!-- -- -->

    This is a comment. This document is not XHTML. <html xmlns="http://www.w3.org/1999/xhtml"/> Ok, I'm done now. -->

    <html>

    <title> Need a title in HTML4! </title> <p> This is a valid HTML4 document.

    </html>

  • Even if you could detect XHTML, what do you do with a document that is not well formed (such as the example above)? If you fall back on HTML4, then there is no advantage to using an XML processor, and you might as well always treat it as HTML4.

  • The HTML working group said that UAs? should not do this:

    http://lists.w3.org/Archives/Public/www-html/2000Sep/0024.html

The advantages of XHTML

When sent as application/xhtml+xml, XHTML has several advantages:

  1. XHTML content will be able to be mixed-and-matched with content from other well-known namespaces (in particular, MathML?). This is the main advantage for content authors.
  2. UAs? will immediately catch well-formedness errors
  3. Tools interacting with XHTML documents are guaranteed a well-formed document.
  4. XHTML content can be parsed with a simpler parser than tag soup can, and a _much_ simpler parser than SGML can.

However, none of these apply when an XHTML document is sent as text/html, and since authors feel their pages should be readable on the most popular Web browser, which does not support application/xhtml+xml, there is basically no point in using XHTML at the moment.

Conclusion

There are few advantages to using XHTML if you are sending the content as text/html, and many disadvantages.

In addition, currently, the majority (over 90% by most counts) of the UA market is unable to correctly render real XHTML content sent as text/xml (or other XML MIME types). For example, point IE at:

http://www.mozillaquestquest.com/

Only Mozilla, Mozilla-based browsers such as Netscape 6 and 7, recent versions of Opera, and Safari, are able to correctly render that site. (IE6 shows a DOM tree!)

Authors who are not willing to use one of the XML MIME types should stick to writing valid HTML 4.01 for the time being. Once user agents that support XML and XHTML sent as one of the XML MIME types are widespread, then authors may reconsider learning and using XHTML.

(Advanced authors should also see appendix B.)

Further Reading

I wrote another document on a related matter: people wanting UAs? to treat XHTML documents sent as text/html as XML and not tag soup.

http://www.damowmow.com/playground/xhtml-in-uas.xhtml

Henri Sivonen wrote a similar document asking what is the point of XHTML:

http://www.hut.fi/u/hsivonen/xhtml-the-point

There are also many mailing list posts on this matter, e.g. on www-talk. The following post summarises some issues relating to using text/html for XHTML content containing XML extensions:

http://lists.w3.org/Archives/Public/www-talk/2001MayJun/0046.html

Some people have run into the problems this document mentions, for example:

http://flrant.com/index.php?id=P21

There are also some interesting points made in other posts, for example:

> But does Mozilla call its xml parser for http://www.w3.org/ ?

Nope. If it did, it would render the page without any expanded
character entity references, since Mozilla is not a validating
parser and thus skips parsing the DTD and thus doesn't know what
&nbsp;, &middot; and &copy; are. Not to mention that it would end up
ignoring the print-media specific section of the stylesheet, which
uses uppercase element names and thus wouldn't match any of the
lower case elements (line 138 of the first stylesheet), and it would
use an unexpected background colour for the page because the
stylesheet sets the background on <body> and not <html>, which in
XHTML will result in a different rendering to the equivalent in

Or this post, near the end of the thread:

I'm still looking for a good reason to write websites in XHTML _at
the moment_, given that the majority of web browsers don't grok
XHTML. The only reason I was given (by Dan Connolly [1]?) is that it
makes managing the content using XML tools easier... but it would be
just as easy to convert the XML to tag soup or HTML before
publishing it, so I'm not sure I understand that. And even then,
having the content as XML for content management is one thing, but
why does that require a minority of web browsers to have to treat
the document as XML instead of tag soup? What's the advantage of
doing that? And even _then_, if the person in control of the content
is using XML tools and so on, they are almost certainly in control
of the website as well, so why not do the content type munging on
the server side instead of campaigning for UA authors to spend their
already restricted resources on implementing content type sniffing?

Appendix A: application/xhtml+xml

See: http://ln.hixie.ch/?start=1036767231&count=1

Appendix B: Advanced Authors

Some advanced authors are able to send back XHTML as application/xhtml+xml to UAs? that support it, and as text/html to legacy UAs?.

Assuming you are using XHTML 1.0 compliant to Appendix C (or have otherwise checked that the XHTML 1.0 you send is compatible with Tag Soup processors), then that's fine. All I am saying in this document is that sending XHTML as text/html ONLY is harmful.

Note: Sending XHTML 1.1 as text/html is NEVER fine. There is no spec that allows this. Sending XHTML 2.0 as anything in a production (non-testing) context is NEVER fine either, since that spec has not reached CR yet.

Also note that I would personally suggest that even advanced authors not use XHTML sent as text/html, since many authors copy and paste markup from others and thus may easily end up copying the valid XHTML markup but using it as HTML4.

Appendix C: Acknowledgements

Thanks to Nick Boalch for the abstract. Thanks to Dan Connolly for pedantry that has improved the quality of this document. Thanks to Ted Shaneyfelt and many others for suggesting improvements to the text.

Appendix D: Footnotes

[1]? The term "handled as tag soup" refers to the fact that UAs? typically are very lenient in their error handling, and do not support any of the "advanced" SGML features. For example, browsers treat the string "<br/>" as "<br>" and not "<br>&gt;", the latter being what HTML4/SGML says they should do. Similarly, real world UAs? have no problem dealing with content such as "<b> foo <i> bar </b> baz </i>" even though according to the HTML4 spec that is meaningless.