When the site GDGT.com went live this past summer, Ryan Block was expecting a lot of interest.
Prior to launch, the former Engadget.com editor in chief had built up momentum for the site -- which allows everyday users to write gadget reviews -- by informing bloggers and online publications. "We were excited but wary, because there's always an x factor," says Block. "We did weeks of performance and load testing, but lab testing will always differ from real-world usage, and we knew there would still be issues here and there that we wouldn't find until thousands of people were actually using the site."
Indeed, on 4 August, GDGT went live -- and a few hours later Block was forced to post a message explaining that the site was not available because of unanticipated levels of interest, which included thousands of users signing up for accounts and visiting the home page. Block says the problem was related to database performance.
Joe Skorupa, a Gartner analyst, says GDGT experienced what he calls "catastrophic success" -- an unusual surge in traffic that can bring a website to its knees. Its seems like there's another story about a site experiencing colossal failure every week: a Twitter outage, Facebook downtime or Gmail problems. (Twitter, Facebook and Google representatives all declined to comment on outages.)
Skorupa says there is a common misunderstanding about the public Internet -- which is notoriously flaky and consists of many interconnected networks. It's not the same as corporate cloud computing, private networks that occasionally use the public Internet, or financial services on the web, which are mandated to be available 24/7. In his view, the public Internet should not be viewed as being as reliable as, say, a private connection between bank offices.
There is also a misunderstanding about a site "going down". Typically, a server has not crashed entirely, it's more likely a data center problem, says James Staten, a Forrester Research analyst.
"A service doesn't go down, but gets so slow that it's viewed as nonresponsive," says Staten. "Load balancers take all incoming requests and route them to the web servers based on their responsiveness. This architecture can become unresponsive when it's overwhelmed by the number of requests for content."
In the end, the GDGT traffic problems calmed down after the initial launch, thanks to improved database speed and caching techniques that were employed to address future problems with traffic.
In other cases, a denial-of-service (DoS) attack, such as the one that caused Twitter and other sites to go dark for several hours in August, can create the same kind of overload and congestion. Staten says other causes of website failure include poorly configured system components and out-of-date patches and updates on web servers.
For most web users, the occasional outage is one thing, but frequent downtime can cause serious business delays. As we rely more and more on web applications, even those related to social networking, Internet uptime is becoming more critical.
The following strategies for dealing with public Internet outages -- which admittedly include some that are more controversial than others -- will help you pave a smoother superhighway to your company's website.
1. Use speedier Ethernet connections
Most organizations currently connect their web servers to the public Internet via Ethernet networks running at 10Gbit/sec. Tom Daly, president of Dyn, a network services company based in Manchester, N.H., says the IEEE -- which controls the 802.1 spec -- is evaluating standardized 40Gbit/sec. Ethernet as a remedy to traffic outages. He says this increased performance would dramatically help high-transaction customers.
Some users aren't waiting for the final specification. One success story is the Amsterdam Internet Exchange, which links 16 10Gbit/sec. switches together to handle about 500Gbit/sec. of daily traffic without disruption. "40Gbit/sec. and 100Gbit/sec. is dearly needed to reduce overall congestion on the Internet," says Daly.
Networks running at 40Gbit/sec. or 100Gbit/sec. won't solve bottlenecks at specific sites, such as Twitter.com, that are overloaded by a spike in requests. Still, these networks will become more important as more and more users gain access to the Web and use it for watching television shows online, making online backups and visiting the most popular sites. All of that activity will increase the stress on 10Gbit/sec. Ethernet and 10Gbit/sec. routers.
"40Gbit/sec. router-to-router links are already being utilized in backbones to help alleviate traffic," says Cory Crosland, the president of CROSCON, a New York-based Web development company. "As Internet usage continues to explode, the backbones will need to be continually upgraded to keep up." This means faster Internet speeds for consumers and better connectivity. Verizon's FiOS is one example of the type of network that could deliver the necessary speeds and connectivity. "Moving to 40Gbit/sec. will help the Internet as a whole move faster, but not improve [any] one site's uptime or speed," Crosland says.
2. Use content delivery networks
For handling large amounts of media -- such as iTunes music or Amazon.com and Netflix video-on-demand services -- the public Internet relies heavily on CDN (content delivery network) providers such as Limelight and Akamai. Microsoft recently launched a free CDN to cache AJAX libraries and boost website performance.
CDNs are designed to quickly route traffic onto private networks, easing or eliminating the burden on the public Web site itself, Forrester's Staten explains.
Without a CDN, massive media files would cripple a site like Netflix.com or Cinemanow.com almost immediately.
A CDN provider handles congestion by adding "last mile" communication centers in cities that, according to the provider's own data models, absorb most of the traffic. For example, in areas of California, video-over-the-web is more common than elsewhere in the country, so Akamai or Limelight might install a center in San Francisco to handle the load.
More use of content delivery networks can definitely help, because these services keep traffic at the edges of the Internet rather than having to route it all the way through, says Staten.
According to Staten, one issue with a CDN is that not all content can be cached. Crosland explains that "CDNs are great for static content" such as videos and music but can't be used for dynamic or database-driven information such as search results and Twitter updates. "That's where intelligent caching comes into play," says Crosland.
Jason Mayes, a senior engineer at XMOS, says that working with Akamai Technologies played a major role in alleviating pressure when XMOS started offering online videos. XMOS would post a 200MB video, and thousands of users would attempt to access it at the same time, stressing XMOS's 10Mbit/sec. connection. "Videos made our site slower for regular users. This was a major concern, as a Web site reflects a company's ethos," he says.
After implementing Akamai's CDN, Mayes says, all page-delivery times, including those for text and video went from 17 seconds to 5 seconds in Asia, where many of XMOS's site visitors originate. He says he might also look at redesigning his company's content management system (CMS) to make it faster now that the main bottlenecks have been fixed.
3. Use more and better caching
Another common tactic for dealing with Internet problems is to cache content. This technique is becoming more common, according to Pieter Poll, chief technology officer at Qwest Communications International Inc. in Denver, because it enables a site to scale up more easily when users flock to popular content, such as a new episode of CSI: Miami on Hulu.com.
Caching on the Internet works just like memory caching in your computer -- holding the most popular content in a cached storage allocation on the server for fast access. (A CDN is also a type of cache, in that popular content is delivered from a separate node.) Staten says tier-caching products such as Gigaspaces, Oracle Coherence and MemCached help cache content within the Web site infrastructure by making sure that content in a database is accessible at all times, even during the worst traffic spikes.
In fact, caching is one of the main ways that sites like GDGT, Twitter, Facebook and others deal with surges -- the technique is perfect for handling small chunks of data that change often, such as popular articles, forum posts or news items.
Yet caching is still not as widespread as it could be. GDGT uses it to solve ongoing congestion -- although it wasn't enough to keep the site running at launch. But many sites that are just gaining traction, such as the video delivery site Crackle, are still tweaking cache settings.
Poll and other industry observers note that Web usage increases at a rate of about 42% each year, but broadband has not increased at that speed, so caching and CDNs are increasingly important.
4. Use better programming methods
One emerging method of dealing with traffic problems is to program using techniques that can withstand spikes. Brian Sutherland, a managing partner at Vanguardistas, a company that provides a scaling architecture for sites such as that of US News and World Report, says that a vast majority of websites are poorly programmed and aren't able to withstand unanticipated traffic spikes.
"There's a large engineering effort required to make sure that a Web site is capable of withstanding a large and sudden load," says Sutherland.
A few examples of things that website software developers rarely do, according to Sutherland, include regularly benchmarking a representative copy of their servers against a simulated load; having an experienced developer review and approve every software change; and designing for the target performance level right from the start. "When you really want your Web site to stay up, you have to do these things. Twitter grew faster than anticipated and brought in a company after the fact to improve its uptime, which has worked."
According to Sutherland, these techniques -- which mirror methods used in enterprise computing -- might have to wait until Web expansion slows down a bit because developers tend to put speed ahead of reliability. He points out that banking Web sites are good examples of development initiatives that emphasized reliability from the outset, likely because they were subject heavy federal regulations and faced customer demands for reliable financial transactions.
5. Use HTML5 and other emerging standards
Not every method of dealing with Web outages is centered on the hardware or the connection between the site and the user. New standards, especially HTML5, have built-in mechanisms for making a site more reliable. Many of those mechanisms involve the use of advanced programming techniques to address site-to-site transmissions.
"HTML5 is a very important advance in browser capabilities," says Michael Gordon, the chief strategy officer and co-founder of Limelight Networks, a CDN provider. Features likely to be important for enterprises, he says, include the canvas tag, which provides dynamic rendering of bitmap images (think Flash-like 2D drawings) that will "significantly advance user interfaces"; the postMessage API, which will allow one Web server to communicate with another through a user's browser; and the client-side storage API, which will allow Web applications to store files on a user client.
In general, Web programming will become more like desktop programming, where data exchange, interface elements and APIs are more solid, and emerging technologies go through a rigorous testing process. One example of this is OpenID, which provides desktop-like functionality (in this case, authenticating one site with the log-in from another) to streamline development. The reusable code and predictable structure for OpenID, OpenSocial, OAuth and other Web standards will make the Web more reliable in the long run.
Not everyone agrees that these standards will promote better Internet uptime, however. Clearly, new standards encourage better programming methods, but they may also lead to even more Web applications and a greater strain.
"HTML5 will allow web application developers to build richer desktop-like applications, and we will continue to see less dependence on operating systems and more dependence on the web and web browsers" to perform the most common tasks, such as managing data shared between web applications, says web developer Crosland. All this, in turn, could ironically make things even more congested.
In the end, most experts -- including analysts Skorupa and Staten -- insist that 100% uptime for every site on the public Internet is not necessarily the goal, and that site operators should still plan for occasional outages.
"You can't ensure your site will never go down," says Forrester's Staten. "Each company has to find the right balance between best efforts for availability and the cost of doing so."
Key uptime tips
Jason Mayes, senior Web development engineer at XMOS, offers his own top 10 list for dealing with site congestion and other potential server outage problems:
- Optimise your static content. Compress images to get every last kilobyte out of them while retaining visual quality.
- Add "expires" headers to content to prevent browsers from continually downloading the same files as a user browses your website.
- Ensure that your web server delivers content in a compressed state -- for example, mod_deflate for Apache. Clearly this should not be applied to files such as images -- which are already compressed -- so make sure you set up rules correctly.
- Optimise your content management system. Reduce the number of database calls you need to make for each page request, for example. In Drupal, this can be as simple as disabling modules you do not need. Also, make any custom code more efficient if possible. A change of one-tenth of a second in an algorithm that is run thousands of times adds up.
- Support caching of data that is frequently accessed. Use Memcache or something similar. Many CMS packages support this, but be careful with dynamic data.
- Load-balance your web server.
- Separate your read/write databases so that you can have a master/slave database setup, allowing your database infrastructure to be scalable.
- If applicable, split your database vertically or horizontally (or potentially a hybrid of these if this model suits your database structure) over several servers. This may not be suitable for everyone, however.
John Brandon is a veteran of the computing industry, having worked as an IT manager for 10 years and as a tech journalist for another 10. He has written more than 2,500 articles and is a regular Computerworld contributor.