In the olden days of serial and coax connections, it was dead simple for network kit to figure out if a connection went down. Everything was direct point to point, and if a cable broke or an interface failed, you lost the carrier signal and knew instantly.
Then came the heady days of shared and now switched networks, and it’s not so easy any more.
Just because your server has an active connection into its local switch doesn’t really mean anything now. The switch could be isolated from the rest of the network, so although you have a physical link, you’re not going to get anywhere.
If you have two routers, for instance, connected via their LAN interfaces through a switch, they can’t tell directly that that link is active. Instead we have to rely on higher-layer protocols to keep track of reachability. And with the push for sub-second convergence and constant uptime, we need to do some pretty aggressive tuning to make sure our network kit can determine as quickly as possible if the link to a neighbour fails and do something about it.
You need to detect that a link has failed before you can do anything about rerouting and converging around the failure. All the summarisation, stub areas and clever network design can only do so much if it takes 30 seconds to even notice a link has failed in the first place.
There are ways to do this, including things like OSPF Fast Hellos or BFD but these add extra overhead to your network—if you want to detect failures in a second or so, you’re going to have to send out an awful lot of hellos very quickly.
So one of the great things about the move to point-to-point Gigabit links between switches or routers, quite apart from the additional speed, is that if you connect two devices via a direct Gigabit Ethernet link, failure detection times are pretty much instantaneous. Upper layer protocols are notified as soon as the physical link goes down, so you’ve now given yourself pretty-near instant failure detection, and now all you have to deal with is the rerouting part.
This is great in the LAN, where you can just run fibres through your buildings, but what about campus networks, or inter-site links? Well, if you have a Metro-Area Network, you can do the same thing. You can get dark (or managed) fibre from telcos that allow you to build GigEthernet-based networks that span cities, or even further if you’re prepared to pay for them.
But therein lies a tale, as they say. A company I’ve been working with has several Head Office buildings scattered throughout a large city. These buildings have been operating as a MAN since the days they had a FDDI ring running round the streets. Now it’s all GigEthernet, but the concept’s the same and they operate it pretty much as a very large LAN.
Except that when you buy a fibre link between buildings, that isn’t necessarily what you get. Telcos aren’t always all that keen in running in point-to-point dedicated fibres for individual customers, so they make use of DWDM optical multiplexers to allow them to provide a circuit per wavelength, so they don’t need so many physical fibres.
As far as the network switches are concerned, they’re just plugged into a direct Gigabit link. They’re unaware of anything magical happening with mirrors mid-stream.
But if there’s a failure in the DWDM environment, it takes a small amount of time (about 50milliseconds) for the DWDM kit to recover. Which is pretty fast to be honest. Just not fast enough for the switches. Because they think they’re plugged into a direct link, they spot the failure straight away and start rerouting.
Meanwhile at the physical layer, the DWDM kit is swapping over to its backup path. Which, at 50mS, is a lot faster recovery than even the best Layer 3 reroute. So the service is interrupted while Layer 3 does its thing, even though the physical path is restored much faster.
Turns out there is a solution to this though. You can configure something called Carrier Delay on the switch interfaces, which lets you say that rather than detecting a link failure immediately, the interface should wait a short time—50milliseconds, say—before it notifies any higher layer protocols that the link is down. Which gives the DWDM kit time to switch over to another path, and so is transparent to any user or management traffic. And if it turns out that it’s something the DWDM kit can’t recover from, well, all you’ve added is a 50millisecond delay which in the scheme of reconvergence is pretty insignificant.
Of course the default value for this Carrier Delay setting is zero, so if you don’t know (or bother) to change it, you’re going to be getting outages when you don’t need to, and blaming your telco for not providing the service you expected, when in fact it is.
This company I mentioned couldn’t figure out why they kept getting outages when they didn’t think they should. Until someone pointed out this 50millisecond time and they realised they hadn’t configured their network kit to account for it. They found this a bit embarrassing. Maybe something worth checking in your network too?