If you're the captain of a Boeing 747 you have the reassurance, if the red lights and klaxons start up, that you can pull out your checklist and know what to do. The same checklist has already told you how to get the thing in the air safely. It has also, hopefully, told you what to do from time to time in order to see problems developing before they get serious. This, then, is the first in an irregular series of checklists for the network manager. It covers what to do when you're not panicking, what to do to reduce your panic if something goes wrong, and how to deal with the panic when it creeps up and kicks you.
1. When you build the systems
Action Reason
Define a standard hardware platform. You want to be able to restore data quickly, which means not having to figure out which boot disks, network drivers and tape software you need.
Procure appropriate disaster-proof storage for your backups. When your building has burned down and then been flooded by the fire service, you'd like the backups to have survived.
Consider the impact of backup traffic on the network and implement a BAN (Backup Area Network) if necessary. You need to be sure that emergency restorations don't have excessive performance impact on the rest of the business.
Ensure free supply of replacement equipment or parts, or keep spares in a suitable offsite location. To restore a server that's broken, you first need the hardware to restore to.
Document the restoration processes and train the staff. You need to be sure that everyone who might need to do a restore knows how to, or has access to usable, comprehensive instructions.
Define a tape rotation process that gives the best compromise of offsite security against accessibility. Document this process in order that the location of a tape is quick to look up. Not all tapes should be stored next to the server, in case of disaster, but if you take them offsite they should be suitably accessible in a rush.
Implement full backups wherever capacity and time permit. If you use incremental backups, back up changes since the last full backup, not since the last incremental. This restricts the restoration process to a maximum of two tape sets, not an arbitrary number depending on the day of the week.
Ensure all information is available but secure. For example, store passwords in the MD's safe, make it known that's where they are and define who is allowed to have them and under what circumstances. If you're the only one who knows the system passwords, you need to think what happens if you’re away when disaster strikes.
Turn on the verification option if the backup cycle has the time to spare. You don't need duff data on the tape when you come to restore from it.
Ensure that offsite versions of all restoration media, software and instructions are easily available. If the building burns down, you can do without the only server boot CD going up with it.
Arrange for emergency equipment and/or temporary premises to be available, if business requirements so dictate. If the loss of business due to downtime is significant, you can mitigate this by employing a third party to park a truck full of suitable kit in your car park.
2. Whenever you buy a new server
Action Reason
Ensure it conforms with the standard hardware platform. If it doesn't modify the standard in as future-proof a way as possible. If you have a standard platform, you can build a standard boot CD for all your kit.
Test the standard restoration process. Manufacturers sometimes change the spec of their equipment and you may need to update drivers, etc., for the boot disk to work.
Consider the new equipment's impact on your disaster recovery plan.

Do you need the emergency truck-full of kit to include an extra server?

3. Once a month
Action Reason
Practise the restoration process, without warning the staff, with as many IT staff as possible. Use more than one of the standard restore/boot disks. This approach ensures that people can react correctly in a crisis. It also ensures that all tapes and information are available, boot disks work and so on.
Review the backup process for capacity planning purposes. It's better to see spare capacity dwindle over time and react before it becomes a crisis.
4. Once a week
Action Reason
Verify the data on between 10% and 20% of your tapes, more if time permits. Even if your backup software has verification turned on, this provides a useful verification service.
Appraise staff and update documentation to reflect any changes in process, hardware, tape location procedure, etc. Staff must know how the process works and where everything is. It's essential that changes do not go un-noticed – particularly if, for example, you’ve done the monthly administrator password change and the records in the MD's safe need to be updated.
Ensure that all recovery media and instructions are up-to-date and offsite copies are easily available. If you change the boot CD, you should ensure that all old ones are dumped and new ones produced.
5. Once a day
Action Reason
Cycle tapes appropriately, ensuring that movement between onsite and offsite locations is handled correctly, as per the documentation. This approach ensures that people can react correctly in a crisis. It also ensures that all tapes and information are available, boot disks work, and so on.
Keep the documentation up to date. Ensure that whoever cycles the tapes also signs the boxes to show that it has been done.
Check for faults and react as appropriate. If the 'clean drive' light has come on, use the cleaning tape. If the server log reports intermittent SCSI errors, investigate it before the entire backup fails next weekend.