I got a phone call yesterday (Sunday) from a company for which I did a server installation a few months ago. The server was a pair of Dell PowerEdges sharing an external StorEdge disk array, configured as a clustered pair with SQL Server running on top; trouble is, they'd moved all their kit across London from one datacentre to another, and they couldn't get the disk array to work.
Now, many storage arrays have the RAID capability built in. That is, you configure the array to tell it what disks you want in what RAID configurations, and it presents these as virtual disks to the hosts - so you just need a normal SCSI adaptor in the host, as the RAID capability is offloaded to the external array.
Not so with this Dell box, where the disk array is just that - a box of disks with a pair of SCSI interfaces. The RAID functionality has to be done on the servers, which therefore need RAID adaptors in them. These particular adaptors were Dell's own cluster-capable ones (two servers on a single disk array means trouble with non-cluster-aware adaptors, obviously). Setting them up's easy:
1. With the disk array disconnected, go into the BIOS of the RAID adaptor on each server and put it in "cluster" mode.
2. Change the SCSI ID of one of the adaptors (they all ship with the same default setting, and duplicate IDs on a SCSI bus are bad news).
3. Connect the disk array to server 1 and fire them up.
4. Configure the RAID volumes.
5. Fire up server 2 and tell it to figure out the RAID configuration by requesting it from the other server.
Why am I telling you all this? Well, when the RAID adaptors find something wrong, they scream out with a very piercing alarm (for which we're yet to find the "off" button). So it was, hence the rather worried client who had a very dead database cluster and less than a day to make it un-dead.
Finding out problem one was easy. Each server has a two-port SCSI card and a two-port RAID card. They'd inadvertently plugged the storage array into the SCSI card, not the RAID one, and so two things happened:
1. The SCSI card said it could see three unusable disks (as you'd expect - they were part of a hardware-configured RAID5 trio, and thus weren't natively readable by Windows).
2. The RAID card was screaming because it thought all three disks had failed.
Swapping the cables into port 1 of the RAID card helped a bit. This time the SCSI card saw no disks (correctly) but RAID card still said it had three failed disks. Swapping to port 2, it said it had three offline disks on port 2, and three failed ones on port 1.
Then came the lightbulb moment, and we swapped server 1's cable into server 2, and vice versa. On firing up server 1, we were told that we had three offline disks on port 1. So we crossed our fingers and told the RAID card BIOS to re-read its configuration (in case the adaptor goes "pop", the card stores some metadata about the RAID configuration on the disks, so that when you replace the card, it can read its settings from the metadata). The "offline" switched to "online", the alarm shut up (thank goodness) and when we rebooted, there were the two logical volumes and all the SQL Server database files. Getting server 2 to work was a simple case of starting up, telling it to re-read its configuration from the disk-based metadata, and hitting "OK".
So what had happened? Well, it goes like this:
1. By booting the server with the plug in the wrong hole, the RAID card had decided all its disks had failed (no surprise there - they weren't connected to it!). So there was no way it'd restart without manual intervention anyway.
2. The RAID card in each server knew that it had three disks with certain SCSI IDs, along with other devices such as the SCSI adaptors in the disk array. Because it had been inadvertently switched to the other interface on the disk array, it was confused when it saw unexpected things, and so decided that the disks weren't just offline, but had failed.
3. Plugging everything back into the right holes made the RAID card decide that it saw three disks, but that they were offline because it had had problems (see  above) and was unable to rely on them.
4. All we had to do to get the disks online was tell the card: "Yes, you're connected correctly - re-read your metadata and carry on".
Now, having the RAID functionality in the host is a good idea. If it's built into the disk array and goes "pop", you're stuffed; if it's built into the host and goes "pop", the RAID card in the other host just carries on. And reduced complexity in the storage array equals a lower likelihood of problems in general. The downside is that you have to be very, very careful what you do, or you're likely to break something. And as you only do clustering when you really care about that system, the chances are that whatever you break will be business-critical.
In this case, we were lucky: the client was sensible enough to not fiddle with anything, and fortunately none of the incorrect configurations caused corruption of the disks' contents, and so we didn't have to reformat and rebuild the data.
The moral of the story, though: whenever you move equipment, treat every connection as critical, document it to death, perhaps even put coloured stickers on your plugs and sockets to make sure they go back correctly. In this case it simply cost the client a hefty cheque for my time, and me a drive up and down the M11 in the worst rain we've had for months. But given that they had just a day to do their entire data centre move (a dozen or so servers, plus assorted telecomms and networking kit), it could well have been "penalty clauses at dawn" for their clients when the services wouldn't work.