LifeKeeper is an application clustering and disaster recovery system that supports Linux (Red Hat and SuSE) and Windows (2000 Server and Server 2003).

There are two parts to the system. The core component deals with disk replication, which it does at block level – that is, you define a "master" partition on machine A and a "slave" partition on machine B, and the low-level replication engine keeps the two partitions identical. (Incidentally, the replicator sits just above the filesystem level, so it simply plonks on top of whatever filesystem type you've chosen – ext3, reiserfs or whatever). You then have a collection of application-specific components which work with the applications' own monitoring, startup and shutdown routines in order to make failovers happen gracefully.
The range of supported applications differs from Windows to Linux (unsurprisingly, given that you get different applications on each) but is extensive, from basic stuff like the Apache Web server right up to heavy-duty DBMSs. The full list of supported applications is on the company website - and if you have an application that's not on the list, there's an SDK for both Windows and Linux that lets you write your own (or, of course, pay someone to do it).

The system runs as an active-passive setup – that is, you have only one half of a master-slave setup running at once. This doesn't mean that you have to have idle servers, though, because in (say) a two-server setup you might have server A as the master or Exchange and the slave SQL Server, and server B as the slave for Exchange and the master for SQL Server. The only real issue is that you'll have to have each application using a different disk partition, as you'll have two replications happening in opposite directions.

To set the system up, you define "resources" – which are basically the applications to which you want to provide failover. A resource is then given a set of dependencies – things like disk partitions and IP addresses (the system defines virtual IP addresses on the servers' network adaptors, which get ported between machines by LifeKeeper as required) and a priority level for each server to tell the system which machine(s) is/are preferred for each application.

Although the basic setup is a pair of servers, there's nothing to stop you defining one-to-many relationships so that (for example) your main Exchange server replicates to an on-site backup, which itself replicates to an off-site version in case the office burns down. (Incidentally, this is where the "disaster recovery" concept comes in – it's not a separate product, just the fact that you're replicating off-site). You can have up to 32 servers in a cluster, though according to SteelEye the only 32-node one they've come across is in their own lab!

There are two types of failover: manual and automatic. If, for example, you want to pull your master SQL Server host apart to add hardware, you simply tell the GUI (a Java application that you can run from wherever you wish) which slave you want to take over. It gracefully shuts down the application on the master, ships the necessary network information to the nominated slave (notably the IP address) and starts up the application there. Automatic failover happens, of course, when LifeKeeper knows that something's gone wrong with the master (either because it crashes and disappears completely or because it's been monitoring the application itself and sees that it's stopped responding even though the server is still up) in which case the slave simply takes over the network identity of the master and fires up the application. Although you can tell the system to automatically switch back to the master when it comes back up, most cautious people will choose to make the switch by hand just so they're in control.

The length and nature of application interruptions varies with the application type, and is entirely proportional to the time it takes to fire up the appropriate application(s) on the slave. If you have a TCP-centric application, your connection will definitely be interrupted when the swap happens – but so long as the application has a sensible retry algorithm, it'll reconnect and you're away. UDP applications, being connectionless, don't have this trouble – but because it takes a tangible amount of time to actually fire up the relevant application on the slave, you may see the system pause for thought (in this case it depends how much data the application buffers at each read – we had just a five-second pause in a video we were playing over an NFS connection, for instance). According to SteelEye, Exchange is the service that users will most notice as it fails over, but from what we could see this is only really because the Outlook client complains that its server looks different all of a sudden and asks to be restarted.

Once you get to grips with the product, it's not difficult to comprehend. There's not actually a great deal to do in the GUI – you configure your servers and pretty much leave everything alone unless you're swapping things in and out. This doesn't mean that there isn't a lot to understand, though, and SteelEye prefers you to purchase a software-and-services bundle through one of their distributors because they've seen it all before and will get the job done in a fraction of the time and with far less hassle than trying to bumble through it on your own (particularly if you're looking at off-site replication, as you need to think hard about WAN throughput requirements). This won't break the bank – the distributor we spoke to when writing this review told us that a couple of days isn't atypical for a modest setup – and particularly if you want to support home-built applications, you'll probably want to enlist the expertise of a reseller's developers to build you the necessary components for your bespoke software.

While we're on the subject of not breaking the bank, it's also worth bearing in mind the potential for saving money on software licences. To provide failover for major applications like Oracle and Exchange requires the Enterprise Edition, and usually extra server licences; with LifeKeeper you can stick with the cheaper version and let LifeKeeper do the failover for you. So even if you do have to pay for a few days' consultancy, who cares?

We really, really like LifeKeeper. If you've never heard of SteelEye (and we hadn't) it may come as a surprise that, for instance, IBM resells their software with its servers, and that LifeKeeper is SAP's official failover product for mySAP. The software itself is ridiculously cheap so if you need to pay a distributor to help you install it, well, frankly it doesn't really matter.

OUR VERDICT

As we've said, go through a distributor ? at least for your first installation when everything's new to you. Once you've been shown it, it's pretty simple, but the learning curve can be steep and it's best to consult someone who's done it loads of times before.