Among web developers, anticipation is mounting for HTML 5, the overhaul of the web markup language currently under way at the Worldwide Web Consortium (W3C). For many, the revamping is long overdue. HTML hasn't had a proper upgrade in more than a decade. In fact, the last markup language to win W3C Recommendation status, the final stage of the web standards process, was XHTML 1.1 in 2001.

In the intervening years, web developers have grown increasingly restless. Many claim the HTML and XHTML standards have become outdated, and that their document-centric focus does not adequately address the needs of modern Web applications.

HTML 5 aims to change all that. When it is finalized, the new standard will include tags and APIs for improved interactivity, multimedia, and localization. As experimental support for HTML 5 features has crept into the current crop of Web browsers, some developers have even begun voicing hope that this new, modernized HTML will free them from reliance on proprietary plug-ins such as Flash, QuickTime, and Silverlight.

But although some prominent Web publishers -- including Apple, Google, the Mozilla Foundation, Vimeo, and YouTube -- have already begun tinkering with the new standard, W3C insiders say the road ahead for HTML 5 remains a rocky one. Some parts of the specification are controversial, while others have yet to be finalized. It may be years before a completed standard emerges and even longer before the bulk of the Web-surfing public moves to HTML 5-compatible browsers. In the meantime, developers face a difficult challenge: how to build rich Web applications with today's technologies while paving the way for a smooth transition to HTML 5 tomorrow.

Modernising HTML for the rich web

Rich applications and HTML have not always been a natural fit. The father of the web, Tim Berners-Lee, envisioned HTML as "a simple markup language used to create hypertext documents that are platform independent." With the advent of XHTML, the pure XML formulation of the language, the W3C maintained this focus on web pages as documents, with the proposed XHTML standards emphasising such issues as document structure, compatibility with XML tools, and Berners-Lee's vision of the Semantic Web.

This frustrated many developers who saw greater potential in the web as an application platform. In 2004, representatives of Apple, the Mozilla Foundation, and Opera Software founded the Web Hypertext Application Technology Working Group (WHATWG), an independent web standards consortium. Working outside the W3C, WHATWG began a parallel effort to revamp HTML for a more application-centric view of the web.

In 2007, with its XHTML 2 work mired in seemingly endless debate, the W3C voted to adopt WHATWG's work as the starting point for a new HTML 5 standard. By this time, even Berners-Lee had come around to the notion of an application-centric web. "Some things are clearer with hindsight of several years," he wrote in 2006. "It is necessary to evolve HTML incrementally. The attempt to get the world to switch to XML... all at once didn't work."

That's not to say the concept of a pure-XML web markup language is dead. Although HTML has retaken the lead role in the standards effort, an XML formulation of HTML 5, to be known as XHTML 5, is being developed at the same time. The difference is that while XHTML 5 will be available for those who have already made the switch, developers will no longer be required to observe the rigorous syntax of XHTML to take advantage of web markup's latest features.

HTML 5: Markup gets a makeover

Be that as it may, HTML 5 has inherited many additions originally proposed for XHTML 2, including a number of features designed to improve document structure. For example, new HTML tags such as header, footer, dialog, aside, and figure allow content authors to specify common document elements in a consistent way. Previously, developers had to mark such elements using div tags with custom class attributes, an arbitrary method that made HTML documents difficult to parse.

HTML 5 also continues the effort to separate web content from presentation. Developers might be surprised to see the b and i elements available in the new standard, for example, but these elements are now used to offset portions of text in generic ways, without implying any specific typographic treatment. Where the i element once implied italic type, for example, in HTML 5 it merely means "a span of text in an alternate voice or mood." Similarly, the b element does not imply specifically boldfaced text, but text that is stylistically offset without having any additional importance.

By comparison, the u tag, which referred specifically to underlined text, has been dropped from HTML 5, along with other presentation specific elements, including font, center and strike. Such stylistic attributes are now considered the exclusive domain of CSS.

The new standard introduces additional data types for form input elements, including dates, URLs and email addresses. Still other elements improve support for non-Latin character sets, including tags for specifying the "ruby text" that appears in some Asian languages. HTML 5 also introduces the concept of microdata, a method of annotating HTML content with machine readable tags, making it easier to process for the Semantic Web. Together, these structural enhancements allow content authors to build cleaner, more manageable web pages that play nicely with search engines, screen readers and other automated content parsers.