the fear of x
August 01, 2001.
By Molly E. Holzschlag. (Link to original article.)
Despite the buzz surrounding XML (eXtensible Markup Language) since it emerged as a standard, most Web designers have yet to incorporate it into their repertoire of site-building skills. HTML is still the most prevalent markup language used to build Web sites. According to the World Wide Web Consortium (W3C), a group that handles the process of examining and writing formal Web markup recommendations, HTML is no longer the recommended markup methodology. XHTML (eXtensible Hypertext Markup Language), which is based on XML, took the place of HTML as the recommended markup more than a year ago (see www.w3.org/MarkUp/ for more information). In fact, the most current Web markup standard is XHTML 1.0. Very few designers, however, are using XHTML, because of the vast confusion XML—and by extension XHTML—has caused Web site builders.
To many designers, XML just doesn't make sense—at least not immediately. After all, most of us did not start out as programmers, and XML seems a lot more abstract and complex than HTML. And XML, unlike HTML, isn't immediately gratifying in terms of creating Web pages. But XHTML is a great intermediate step—a bridge between HTML and XML that not only is useful but can help Web builders conquer their fears of XML. In fact, understanding what X means and how to start using it isn't as difficult as you might think.
In the beginning
HTML is an application of SGML (Standard Generalized Markup Language). SGML is what is known as a metalanguage; its purpose is to create other document markup methods such as HTML. SGML is complicated and syntactically strict, but its applications span industry and business. When HTML was created, most of SGML's complexity and strictness were discarded. The idea that forged HTML was a simple language that worked with Internet protocols.
HTML focuses on the structure of a document. Originally designed for text, it was never meant to be a language of design. HTML included very few facilities for formatting documents, though some tags (such as <font> and <table>) were added to make documents more user-friendly. But the increasing popularity of the Web meant that designers had to do more: Everyone wanted sites that were attractive and interactive. Too often, HTML was stretched far beyond its limits, resulting in complex code that was difficult to maintain. And despite the introduction of Cascading Style Sheets (CSS), which are intended to separate document formatting from presentation, browser support has been extremely problematic, forcing Web developers to rely on only HTML for presentation. This has also resulted in extremely bloated Web browsers containing plenty of forgiveness checking of poorly written markup.
HTML 4.0 sought to solve some of these problems by reexamining document structure. Initiatives in HTML 4.0 included:
- Adherence to one of three document type definitions, or DTDs—Strict, Transitional, and Frameset—which form the basis of XHTML 1.0 (see "The XHTML DTDs" on the next page).
- Greater accessibility for people with disabilities.
- Separation of presentation and content via style sheets.
- An awareness of the growing need to internationalize documents.
HTML 4.01, the latest incarnation of HTML, fixes a number of problems in the 4.0 recommendation.
Interestingly, very few Web designers have a thorough understanding of HTML's rules or principles, because many of them missed out on critical information, through no fault of their own. Most designers bootstrapped their way into the industry, learning from viewing source code, watching coworkers, visiting a Web site, or reading a book. And time has always been a critical factor: When you have to get a site to a client quickly, worrying about markup is less important than simply making the site work! The fact is, a fair amount of the HTML created today does not conform to HTML's rules. This is not necessarily a problem in today's browsers, but it will certainly create problems in the newer user agents such as PDAs, cell phones, and other devices.
Unfortunately, the fact that so many Web developers and designers haven't been concerned with adhering to underlying HTML 4.0 concepts has contributed to the fear of X. For anyone building a Web site with only a general understanding of HTML recommendations or relying only on a visual editor such as Adobe GoLive, Macromedia's Dreamweaver, or Microsoft FrontPage, the emergence and focus on XML can easily seem unnecessary, abstract, and daunting.
Extending the idea
While Web designers were wrestling with cross-browser problems and dealing with markup and design elements, developers interested in more efficient markup brought XML to life. Unlike HTML, which was created with SGML, XML is a metalanguage—a subset of SGML. XML was conceived as a language that would have much of SGML's capabilities but without its complexity. XML would also inherit SGML's strictness so that the markup would be accurate and complete. Although SGML is very detailed, XML is more concise and especially suitable for Web applications.
In all cases, XML's focus is data. The markup is simply meant as a way for user agents to take that data and do something with it. Presentation is left to style sheets and not included in the document itself. The key to XML is that you can customize the language to suit your needs by combining the tags, DTDs, and other elements you create. It may sound complicated, but it can be easier than you think.
In Figure 1, the XML markup contains nothing you can't immediately identify—except possibly the XML declaration on the first line. We're simply labeling data with a set of tags. At its core, this is probably not very different from what you do with HTML now. The tags, however, define data types that are meaningful. Not only can humans understand them but machines can also deal with them.
Figure 1: In this example of XML markup, the tags are both machine-readable and intelligible to humans.
<?xml version="1.0" standalone="yes"?>
<AddressBook>
<entry>
<name>Jon E. Persen</name>
<address>4445 East Hilltop Road</address>
<city>Soulville</city>
<state>CA</state>
<zip>000000</zip>
</entry>
</AddressBook>
Making the transition
If HTML is where we're starting and XML is the ideal, how do we get there from here? Facing the problems inherent to HTML, W3C members studied HTML in the context of XML's paradoxical strictness and flexibility. What they came up with is XHTML, a reformulation of HTML into an XML application.
In other words, HTML 4.1 has been reworked to conform to XML syntax rules. XHTML is an XML document with an HTML vocabulary—which is why it's readable across platforms as well as on past and present browsers. Figure 2 shows an XHTML 1.0 markup. You'll immediately see some differences. As in Figure 1, there's an XML declaration at the top. Just below that, you'll find the document type declaration, which describes the document conforming to the XHTML 1.0 Transitional DTD. In the opening HTML tag, there's the attribute xmlns (XML Namespace), which in this case describes the XHTML namespace. (In XML and XHTML documents, all elements belong to a particular namespace, which is like a list with a unique name. The idea is that elements can be used in other documents and that the tags you define won't conflict with other people's tags.)
Figure 2: This sample document points to the XHTML 1.0 Transitional DTD and the XHTML namespace.
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1l/DTD/transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Welcome to Danny's Web Site</title>
</head>
<body>
<h1>Hi There!</h1>
<p>Welcome to my Web site. Here you can:</p>
<ul>
<li>Read about me</li>
<li>See pictures of me and my family</li>
<li>Listen to my favorite music</li>
</ul>
<p>Cool! Let's <a href="next.html">get started</a></p>
</body>
</html>
When you look at the document itself, however, you see familiar HTML tags, because XHTML uses HTML vocabulary. Are there any differences at all? Actually, there are quite a few, thanks to the influence of XML. But you shouldn't have much trouble if you keep a few rules in mind. Some of the most significant syntax rules in XHTML 1.0 include:
- All elements and attribute names must be lowercase:
<p align="right"> ... </p>. - All nonempty elements must have closing tags:
<li> ... </li>. - All empty elements must terminate with a trailing slashes:
<br />. - All attribute values must be quoted:
<div align="center">.
In HTML, most of these issues are arbitrary, and several methods are acceptable. In XHTML, there are no exceptions to any rules.
XHTML, like HTML 4.0 and XML, asks its authors to separate formatting from presentation using content. Thus, you can employ style sheets via CSS or XSLT (XML Stylesheet Language Transformations). With the resulting streamlined syntax, XHTML becomes very attractive as a means of marking up documents for use in devices such as PDAs.
The xhtml dtds
Document Type Definitions (DTDs) specify markup rules a particular types of documents so that they can be understood by user agents. Validating a document means checking its markup against a DTD and reporting errors.
For a document to conform to XHTML 1.0, it must conform to one of the three DTDs first described in the HTML 4.0 standard:
- Strict DTD.
- A DTD that excludes the presentation attributes and elements that W3C expects to phase out as support for style sheets matures.
- Transitional DTD.
- A DTD that includes the aforementioned presentation attritubes and elements.
- Frameset DTD.
- This DTD is typically used for documents with frames and is identical to the Transitional DTD except that in frameset documents, the FRAMESET element replaces the BODY element. The three DTDs are very similar, and the W3C recommends that you use the Strict DTD if possible. Because the Transitional DTD is the most forgiving, however, it's likely to be the most used DTD for some time.
Into the future
Should you be writing XHTML 1.0 for all your documents? There's no single correct answer. If you want to conform to current recommendations fully, then you should write XHTML 1.0 instead of HTML. On the other hand, you have to consider XHTML carefully, because its DTDs leave out numerous proprietary elements and attributes that you—or your software—use all the time. These presentation concerns require adaptation: You must either let go of the attribute (a topmargin attribute in a body tag, for example, is not allowed in XHTML) and find a workaround via style sheets or choose not to create conforming documents. If you're motivated to think beyond the browser, however, you'll see clearly why the separation of presentation and formatting is such an imperative for the extensibility of Web markup.
In terms of best practices, XHTML is an opportunity for all Web designers to learn markup methods that adhere to standards and that ease you into the world of extensibility. XHTML can open your horizons to other XML applications; you'll be able to look at WML, SVG, or any other XML-related markup and grasp it quickly. Learning XHTML is extremely useful for Web authors seeking backward compatibility and future vision.
This just in
On May 31, 2001, the World Wide Web Consortium (W3C) released XHTML 1.1 as a recommendation. This recommendation is an ongoing effort by the W3C to reformulate HTML into a truly extensible XML application. According to the W3C, XHTML 1.1 defines a new, forward-looking document type "cleanly separated from the deprecated, legacy functionality of HTML 4." XHTML provides a set of modules that developers can use to extend the markup to include other tagsets. It also delves more deeply into some issues regarding internationalized fonts via the Ruby Annotation. For more information on XHTML 1.1, visit www.w3.org/TR/2001/REC-xhtml11-20010531/.



