Your customers—and your managers—don’t actually care one iota about the reliability of your network. What they’re interested in is the availability of the services it offers. In other words, they want it to be working when they need it, and the fact that you can quote MTBF figures in excess of ten years for the individual network components is irrelevant if they can’t access the email server when they decide to.
So what are we talking about here? Let’s start with some numbers. The ‘five nines’ figure we hear about all the time actually means that in any one year, you can afford to have one five minute outage. That’s a scary thought. So how can we possibly build a network that even comes close to this?
First off, if you’re going to offer anything termed ‘high availability’, or anything with any reference to SLAs, make sure you can measure it first. Are you guaranteeing basic connectivity, application availability, or specific measures such as jitter and response times? If you can’t measure it, and prove you’re matching (or not) what you’ve contracted to provide, you’re wasting your time.
Then get things in perspective. Suppliers will use all the hype around availability to try and sell you two of everything—‘for resilience’. Is this really appropriate? As an example, say a switch failure takes out 50 people for half an hour while you wheel in the cold standby. You’ll be blamed for 25 hours of downtime. Sounds a lot, and maybe you should have had a dedicated switch ready to take over in seconds.
But what if your company has 1,000 users? Over a 720-hour month (assuming 24 hour network requirement ), that’s 720,000 people hours. The loss of 25 hours is 0.0035 percent, so you’re still giving them a 99.9965 percent service. Weigh that up against the cost of that extra switch before you start beating up on yourself.
Of course these are business decisions, not technical ones. The unseen costs (loss of business, if it’s a trading floor, customer satisfaction in the case of an ecommerce site) may well justify the extra expense. All you can do is highlight the risks and associated costs to alleviate them, and let the business decision makers have the final say.
So how do you alleviate these risks and design for a high availability network? We can’t get into the nitty gritty technical detail (see related article on Building a Resilient Server Farm for specifics) until some basics have been understood.
Factors Affecting AvailabilityThings you have to consider are hardware and software reliability, your carrier, if you have WAN links, power and environmentals. And then design accordingly.
Which suits your requirements best—redundant boxes for resilience, or single, more expensive devices with inbuilt redundancy? The first is more complex to design around, the second limited more by the component architecture. And vulnerable to external factors such as flooding in the comms room.
Software upgrades are a necessary evil, but don’t need to happen as frequently as your suppliers make out. New code, with new features, will have new bugs. Choose a stable version that has been out for long enough for someone else to have found the problems first.
What SLAs does your carrier offer? And how do they report on them? Don’t promise your remote branches 99.999 percent availability if their server is at head office and the Service Provider is only offering 99.9 percent on the links. Have you simulated failure situations with them to make sure the escalation process works?
Basic and boring—but do you have a decent cable management program? Are all cables labelled—and plugged into to correct places? Have you checked with the network equipment that what it thinks it’s plugged into agrees with your plan. Having cables crossed and disconnecting the one surviving working link by mistake is ridiculously common.
And power. How many core switches do you have hanging off a multiblock in the back of a rack? Who ‘borrowed’ the power lead for the second PSU on your server/switch, meaning to replace it instantly, and then forgot? Is the air-conditioning enough to cope with new hardware—even if it doesn’t actually get hot enough to cause equipment to shut itself down, component failures will increase by a significant rate if it’s just too hot to be comfortable.
Last but not least, operations. According to a Gartner report a couple of years ago, 40 percent of factors affecting availability were related to people and processes (or lack of them). Poor change management and control will undo all your good planning work. If you don’t know what changes were made, how can you back them out? If they’re not tested, how do you know what effect they will have on your existing applications? Do you profile new applications to determine expected changes in traffic patterns and loads?
Don’t jump merrily into Visio diagrams showing lots of jazzy new switches with super features and impressive architectures until you fully understand what you’re expected to deliver, and have the mechanisms in place to monitor them. Get the basic design to suit your environment before you even start to look at data sheets, and you’ll be doing yourself—and your company—a big favour.