Network vendors tell you that you must build "fully redundant, resilient networks" to ensure maximum uptime. Are they just trying to sell you extra kit, and what can you do to improve network availability without paying too much?
Resilience versus redundancy
First off, what do vendors mean when they talk about redundancy and resilience - aren't they the same thing? Well, no. Redundancy means installing backup systems, such as power supplies, processors and WAN links, that kick in when the primary fails. In many cases they're actually used all the time to share the load, but it's important to remember what they're there for. If you start relying on your second PSU to provide power for all the line cards you've installed in a switch, for instance, be prepared for some of those cards to fail if you lose a PSU and find you can no longer drive the whole switch because you've added more cards than one supply can cope with.
Resiliency means the methodology you employ, and the configurations you use, to make your network tolerant of failure. Having a redundant link between two sites, for example, doesn't do a thing for you if you've not configured your routers to make use of it if the primary fails. ISDN backups are a prime example here - there have been instances where a carrier has ceased an ISDN line because it was never used, so they assumed it wasn't actually live, only for it to become rather urgently needed when a WAN link failed.
So how do you configure resilience? First you have to decide how much you need, and where you need it. Although some resilience can be configured at no or minimal cost, in many cases you still need to pay extra, and shelling out for a level of availability that nobody really needs is as bad as failing to provide the resilience it does need.
If you have a switched LAN, chances are you've run redundant links between your switches, since this doesn't cost much extra. You'll run Spanning Tree to avoid loops, and create multiple VLANs to let you make use of those ‘spare' links.
But while Spanning Tree does give you an alternative path through the network if a link or card fails, it can take time to sort itself out. And in the meantime, no traffic will be carried. It used to be that an outage of a few minutes was acceptable, since the alternative was waiting for someone to physically replace, repatch or reconfigure something, but on a resilient network the users shouldn't even notice if a cable snaps or a switch blows up. So you have to make sure you've tuned the setup - even if you're running Rapid Spanning Tree (802.1w), you can experiment with timer settings to suit your environment.
Installing multiple links between two devices lets you configure link aggregation (802.3ad - EtherChannel in Cisco-speak), so that you don't even have to worry about Spanning Tree timeouts if one of these links fails. As far as the Layer 2 protocol is concerned, it's just one link, so if one individual connection fails, there's no outage. Most hardware allows you to spread these links over multiple line cards in a chassis-based switch, so even if a card fails, your users shouldn't notice.
Your routing protocols also offer resilience, but again it may be a matter of tuning to make recovery quick enough. The maximum number of equal-cost WAN links your router will load-balance over may be vendor dependent and can potentially be changed. If you have unequally costed paths (different bandwidths, for instance), you may find you're not using the second one. It can be argued that having a backup link that's significantly smaller than your main one is a bit pointless, but at least you can set filters to let only the vital traffic through if your primary link fails.
Again, timers that determine how long it takes for your routing protocol to notice a failure can be tweaked - it's best to practise this in a test environment or out of hours though - and the algorithm itself will define how quickly a backup route can be brought into service, so make sure you know how they work by default and what you can alter.
Going back to redundancy for a moment, if you have multiple processors in a switch or router, how quickly does the secondary come online if the primary fails? You may have to pay extra for the software that provides the high availability option. Claims that a backup processor duplicates the state of the main one, so can take over quickly, need verification - this may (or may not) be true at Layer 2, but regardless of the clever things the hardware manufacturers can do, there are limits to how neighbouring routers learn new topologies and routing decisions are made. Some of these will need changes to routing protocol code to completely fix, so, short of proprietary mechanisms, it's unrealistic to expect completely hit-free recovery.
The main thing to understand is how the various network protocols on your network behave, and what can - and should - be tuned. Investigate all the non-default settings, try some changes out (not in your live network!), and see what you can do to improve recovery times so your users don't spot the outages.