I’ve asked a number of people recently whether they’d installed clustering on Windows servers. Turns out that none of them had – though one had inherited a clustered server pair that he looked after (or which, it turned out, pretty well looked after itself). Turns out that in many cases people think that clustering is either (a) too expensive or (b) too hard to justify the benefits.
It can’t be denied that clustering is expensive when compared to the alternative (ie buying a single server with redundant everything and plenty of RAID). Not only do you have to buy more than one server, but if you’re in a Windows world you need the Enterprise version of Windows server – which has a hefty price tag compared to the Standard edition. And what about the benefits? Well, in a server with redundant disks, PSUs and fans, and with a decent architecture that can even swap out duff RAM, the main thing you’re gaining is protection against the server motherboard going a bit Pete Tong. Which isn’t all that common.
Something else I discovered is that the difference between clustering and replication is widely misunderstood. Clustering is often taken as synonymous with the act of having two servers with copies of all the programs and data, with replication turned on so that server B keeps a copy of server A’s world (and potentially vice versa) such that one machine can take over in the event of the other failing. This isn’t, in fact, what you do if you’re using Clustering Services under Windows – it’s just data replication.
In a Windows cluster, you have two or more servers running the programs, with a shared disk (or disks) holding the data. All the servers are able to see the data (which could be on a SAN of some description or could be connected directly to each server via SCSI or Fibre Channel) and the servers decide between themselves which is allowed to use the shared disk(s) at any time. Both servers have copies of the application programs on their own internal disks, and they negotiate between themselves which machine is running each application at any given time. In the event of a problem, each affected application is switched to a different server. Let’s go through how it all hangs together step by step.
1. The shared storage The shared storage subsystem is potentially the weak link in the system – after all, there’s only one of them. It’s essential, therefore, that you’re able to use redundant power supplies and external connection adaptors (Fibre Channel, SCSI, etc) and that you have disks in a sensible RAID configuration (at least mirrored or RAID 5, preferably a higher level in order to allow for multiple drive failures).
Now, there are two approaches you can use in the RAID array. One is to buy an array which has built-in RAID capabilities – that is, the box itself deals with the RAID task and presents the disk set to external devices as one or more logical volumes. The other is to buy an array which is simply a box of disks, and implement the RAID function on adaptor cards in your servers. The first option is certainly the neatest, since all you have to do is bung a standard (non-RAID) adaptor in each server and the disk array does the RAID. The second option’s supported by a lot of popular kit, though (if you buy something like a pair of PowerEdge servers and a PowerVault 220 from Dell, this is the way you’ll be working), and it’s not rocket science to get working.
If you’re using a pair of servers and a single disk array, SCSI connection’s probably the way you’ll choose to go, with each server connecting to a SCSI card in the disk array (don’t forget to set the card in each server to a different SCSI ID, or you’ll break things). With more servers, or perhaps multiple arrays, you’ll be heading for a SAN of some sort to achieve your many-to-many connections.
2. The servers On the face of it, it sounds like you can buy a bunch of cheap servers and cluster them so that they take over from each other in the event of a failure. It’s not quite that simple, though, when you realise that you’re not actually clustering servers, but services.
I’ll explain. The common way to implement clustering is an active/passive setup – that is, one entity is the live server dealing with the applications, and the other is watching the live one and can take over if the former goes away. The thing is, though, that you don’t have to run all your live stuff on one server. To contrive a scenario, say you have SQL Server and Exchange and a pair of servers. It seems a shame to have both applications running on one machine, with the second machine sitting idle 99.9% of the time. Why not have server A run the active SQL Server and a passive instance of Exchange, and server B run the active Exchange and the passive SQL Server? That way you get good use of your kit, but if one server dies you retain both services – albeit with some speed degradation. Incidentally, you can have the same application running actively on multiple servers in some circumstances, but we won’t go there just now – have a look at http://technet2.microsoft.com/WindowsServer/en/Library/8846a72b-0882-4a24-8eee-a768e52925281033.mspx?mfr=true if you’re interested in doing this.
Because you have multiple servers running live instances of applications, then, you’ll want some redundancy. Redundant fans and PSUs are a definite must, and you’ll probably want to have a mirrored pair of disks as the boot volume, just in case.
3. Addressing This is where it gets clever. Each server will, of course, have its own IP address. What you do with clustering is to add “virtual servers” that are understood by the operating system. A virtual server has its own little resource set (basically an IP address and a shared disk volume), and it is this “virtual” IP address that is used by the cluster to decide who answers what calls from the network. So if your SQL Server service has the address 10.1.1.10, and server A is normally the active server for that service, it will answer requests aimed not only at its own individual IP address but to 10.1.1.10 as well. If server A dies, server B starts answering requests to 10.1.1.10 – so as soon as clients’ ARP caches get flushed, their requests get magically dealt with by server B.
4. How the servers interconnect The sensible way to go for clustering is to have two LAN interfaces in each machine – not a problem these days, as most servers come with two interfaces anyway. One interface on each machine is used to connect to the active LAN and service requests; the other sits on a private LAN and simply deals with cluster-related communications that the servers throw between them (mainly the frequent “heartbeat” that is sent between machines to check there’s life). You’ll also find that the servers in a cluster use a small (500MB or so) shared disk volume (called the “quorum” in Windows parlance) to store shared data.
5. How you get started Starting up clustering is actually dead easy. First, in a Windows world all your servers need to be in an Active Directory (AD) world, and you’ll need to set up a user ID that they can all use. Then you turn on just one server and use the normal OS and vendor tools to configure the RAID arrangement and define/format the various disks that are to be shared. Next, run up the Cluster Administrator application and walk through the wizard answering some dead simple questions (what user ID to log in as, which volumes to treat as shared, etc). Within five minutes you’ll have a one-node cluster. Then it’s just a case of turning all the other servers on and hitting “Add node” in the management application).
6. Applications In theory, there’s nothing to prevent you from clustering pretty well any application by defining a virtual server by hand, allocating resources, installing the application to the virtual server, and so on. Nine times out of ten, though, clusters are used to run applications that are inherently cluster-aware. This means that if you install, say, SQL Server, the installer will say: “Ah, that’s a cluster – would you like me to create a virtual server for you and install myself on it?”. Not only this, but because you’re installing to a virtual machine, the software gets installed on all servers at once – that is, you don’t have to walk around every machine with the CD.
7. Dealing with failovers In a two-server cluster, it’s pretty obvious how the nodes will behave in the event of a server problem. With more than two nodes, though, you can manually define how each service (virtual server) is to be dealt with in the event of a failover. How complex your approach is is entirely up to you – with shedloads of servers you can have every machine actively running something, or you can choose to have one or more "hot spare" machines that sit idle until something breaks. Oh, and if you want to take a server offline for maintenance, you can manually failover services.
8. Will the user notice? In a connectionless application (ie one that doesn’t retain a constant connection to the server, but instead connects as required) there’s every chance the users won’t even notice a failure – or if they do, it’ll simply be a slight pause while the virtual server addresses get swapped. For connection-oriented applications, though (eg Telnet sessions) the connection will be terminated and the user may get a "Connection lost" error message. So long as the application has been written to deal with lost connections (and let’s face it, it’s no different from what would happen if someone kicked a network cable out) it should retry the connection and will reconnect quite happily to the new machine that’s taken on the server’s address.
Summary Don’t be scared of clustering. It’s easy to set up, and although you need multiple servers, they don’t have to be massively expensive. It’s not unreasonable to expect to go from a pair of servers still in the cardboard boxes to a basic two-node SQL Server cluster in a day – and once it’s up and running, the management tools make it easy to monitor machines and add/change/remove services as you wish.