Anyone installing a server in a business considers the issues of resilience – to avoid unwanted failures we buy RAID disk arrays, redundant power supplies, redundant fan units, even extra CPUs. But what happens if something goes wrong with the network side of life? What happens, for instance, if the LAN adaptor blows up, or the Ethernet switch that the server's connected to? Wouldn't it be great if you could put as much network resilience in the server's network connection as you have in its internal hardware?
Note at this point that there are plenty of ways of addressing this problem expensively and without the users noticing any outages at all, save perhaps for a few seconds' uncertainty while routers and switches auto-reconfigure. A popular approach is to use switches, routers and network adaptors that comply with protocols such as Cisco's HSRP, whose entire existence is due to the desire to keep things working with zero downtime.
This isn't what we're setting out to do here. We're more concerned with providing acceptable automatic failover for little or no extra cost in a network where a short outage is acceptable so long as everything does continue to work after a few minutes.
Two network adaptors
The first obvious thing to do is install a second network adaptor in the server – after all, it's only 20 quid. This is, of course, the easy bit – the interesting part of the story is how you get the world to understand that there are two ways to get into the server.
The main issue with connecting a server through two network adaptors is how you get the machines on the network to understand (a) which adaptor to send stuff to by default; and (b) how to get to the server through the secondary adaptor if the main one dies.
Getting packets out
Before we consider this, we should consider how the packets get out of the server onto the network. If we have a pair of similar interfaces (eg they're both Gigabit Ethernet) the server's built-in IP drivers will pretty well deal with this for us by choosing an arbitrary interface to send packets through. In some installations the primary interface may be faster than the secondary (we may have a Gigabit connection that we want to use by default, but a Fast Ethernet card connected to a 100Mbit/s switch for the backup link). In this case, we could change the metrics of the server's internal routing table such that the route to the local subnet has a lower metric for the Gigabit card than for the Fast Ethernet card. Either way, should the interface out of which the system is sending its packets lose its link, the operating system will think for a few seconds, then flag that interface as unavailable and update its routing table appropriately.
Getting packets in – Windows
That's the easy bit done; how about the other way around – that is, how do our workstations see the server when its main interface dies?
Let's set up a server called "PIII-1000" (it's a Pentium-III 1GHz machine and we're not very imaginative) that's connected to the network in two places. Both connections are in the same subnet (the Class C subnet 192.168.1.0); one has the address 192.168.1.222, the other 192.168.1.223. Both adaptors are connected and our client computers see the server just fine.
Now we pull the plug on one of the connections. After a few seconds the server gets its bearings and disables the disconnected NIC, but as this happened to be the connection the server was using to talk to the LAN, the client gets horribly confused. This is because although Windows systems refer (as far as the user's concerned) to remote systems by name, behind the scenes it's all down to IP addresses.
Note at this point that the mapping between names and addresses is being handled by Windows' NetBIOS naming system – we're not touching the domain name service at all (yet). So when we wanted to talk to server "PIII-1000", Windows did some NetBIOS lookup magic and translated the name into the address 192.168.1.223. It then remembered this address for future reference, so that it didn’t have to do the lookup for every transmission. Fortunately for us, it only stores the information for a finite time (generally a few minutes) – and so when its "NetBIOS cache timeout" has expired, it'll have to do another lookup.
Which indeed it does. So at the very worst, we'll be without connectivity to the server for whatever time the NetBIOS name cache timeout happens to be set to – and when the NetBIOS cache timer ticks down, Windows will re-fetch the server's address and we'll be back up and running.
Getting packets in – Unix/Linux
If we're not using Windows on the server, we have a potential problem – namely that we're probably using the Domain Name Service instead of NetBIOS. This means that we almost certainly don't have the ability for the address corresponding to a server name to change on the fly without human intervention.
We therefore have to be a bit sneaky; there are two ways to proceed and both are variations on a theme. First of all, we set the server up the same as our Windows instance described above – each adaptor links to a different switch. What we do with the interfaces depends on the option we want to use, though.
Option one is to have both adaptors active, each with a different IP address, and to make the DNS change its settings in the event that the primary adaptor should fail. This involves setting a up a simple script that "pings" the server's primary address every minute or two, and if it doesn't get a reply, it modifies the address entry for the server's name in its configuration files. So long as the timeout is set to something small in the configuration (five minutes isn't unreasonable), all of the machines that are trying to contact the server will perform a DNS lookup and will be told the new address.
Getting the DNS to change itself is a pain, though, and if you have several servers the scripting exercise that allows it to change its own configuration is non-trivial. It's a much nicer idea to have each server deal with its own problems. So in the DNS we allocate one IP address to our server, and configure this address into the primary interface. We disable the secondary interface, and don't allocate it an address at all. We then run a script on the server that "pings" a machine we know we can rely on to be up (maybe a router on the network, as routers are rarely rebooted). In the event of a problem on the primary interface, our script disables the interface, allocates the primary interface's address to the secondary card, and enables the secondary interface. Et voila, the server is magically still there, with the same IP address (so no problem with the clients seeing a change of address).
There's just one problem with this approach, though: although the server name still maps to the same IP address, the IP address doesn't map to the same MAC address as before – and nor is it in the same place on the network as far as the ARP tables of the client PCs and the layer 2 switches are concerned. If you're using this option, then, you need to make sure that the ARP cache timeouts for everything on the LAN are set to a sensible value that gives a good compromise between minimising ARP broadcasts and minimising refresh times.
If you want invisible (or, realistically, near-invisible) failover, it'll cost you money, because you'll need proper hardware that inherently supports failover protocols such as HSRP. You can, however, give yourself a modest level of resilience for next to no cost, either by relying on the built-in capabilities of the operating systems (if you have Windows servers) or with a little bit of ingenuity (if you have Linux).
The trick is to be sensible about your use of addressing systems. Hard-coding IP addresses into equipment is a recipe for disaster – for ease of administration and support you should always use some kind of naming mechanism, be it NetBIOS/WINS under Windows or the DNS under Unix/Linux.
And one last thing: when you're writing your routines to handle the automatic failover, don't forget that you'll also need to have some scripts or procedures for getting everything back onto the primary interface once you've fixed the problem.