In an IT world full of elusive goals, there's probably no target as slippery and generally elusive as server uptime. Keeping servers alive and awake, or at least ready to instantly spring into action whenever needed, is an ambition close to the heart of virtually all data centre leaders.

Six steps to maximising server uptime

  1. Plan carefully. Aggressively enforce lifecycle management, and double check the work, including system configurations and maintenance schedules. Server acquisitions and upgrades should be scheduled and coordinated with an eye toward system availability as well as performance.
  2. Practice routine preventive maintenance. This is perhaps the easiest and least painful way of bolstering server reliability. As the old car repair commercial warned, "You can pay now or pay later."
  3. Use management and monitoring tools. Without adequate oversight, you can't get to the root of uptime-robbing server problems or measure downtime's impact on critical business services.
  4. Bolster security. Don't let attackers interfere with your uptime goals. Anti-malware products, firewalls and independent audits are among the many security tools and practices that have a positive influence on server uptime.
  5. Acquire quality hardware. The road to downtime is paved with trashy servers.
  6. Use common sense. Don't waste time, energy and money trying to squeeze the last drop of life out of an aging or problem-prone server.

Yet few managers can honestly say that they are doing absolutely everything to squeeze the most uptime out of their systems. Indeed, many managers needlessly lavish time and funds on technologies and practices that have little or no positive impact on uptime, experts say.

Achieving server uptime excellence is both a science and a management art, says Walter Beddoe, vice president of IT and logistics at Six Telekurs USA, a financial data provider. "It's a combination of many different things, including having a competent staff, using fault tolerant hardware, adopting dynamic security practices and embracing good maintenance and change management practices," he says. "Most of all, you must have a commitment to doing your very best."

Alan Howard, IT director at Princeton Radiology, a diagnostic medical imaging firm, urges managers not to waste time and resources on activities and tools that don't directly contribute to uptime enhancement. The effort put into clustering for example can be "pretty wasteful" he says, noting that redundancy is better achieved with a tool that provides full automation.

Clustering that is not automated, where the synchronisation is done manually, can cause more problems than it's worth Howard says. "A failure of a primary node can cause havoc, we'd have been better off simply recovering from the primary node failure than failing to the standby node," he says.

For instance, his shop had a Windows Server cluster that upon failover would cause the application to crash because a change to an application configuration file had not been applied to the standby server. "The effort to fix the cause of the application crash tended to be much more than the effort to fix the cause of the cluster node failure," Howard says.

His shop no longer provisions clustered servers in the traditional sense. Instead, he has a "cluster" of standalone servers, all mapped to a dual controller Compellent Storage Center SAN, "among which we can migrate virtual machines on demand quite seamlessly."