Disk drives are mechanical, and so inherently unreliable. ATA-class drives are less reliable than SCSI or Fibre Channel drives, being built for non-24x7 duty cycles. But the actual duty cycles they can withstand are unknown. All that you can know is that they will fail. As Stuart Gilkes, systems engineering director, N Europe, for Network Appliance, says, "Disks drives break. So deal with it."

What can you do to deal with it? Considering that a crashed drive is unrepairable then the 'deal with it' phrase means dealing with it before the drive fails. And that means detecting a failing drive and copying the data off it before it fails.

If a drive develops bad blocks, areas from which data cannot be read, then as long as the drive has RAID protection the data can be recovered. However, Gilks explains that RAID reconstruction can take a fair amount of time, particularly with high capacity drives. As we head towards 500GB ATA drives then the chance of a second drive failing, whilst RAID reconstruction is taking place from an initial drive failure, becomes more likely.

Network Appliance's double parity RAID scheme can cope with that whereas RAID 5 cannot.

We are heading towards a situation where bigger drives means more data loss from drive failures. The good news is that drive failure can be predicted.

Predicting drive failure
Disk manufacturers have added S.M.A.R.T features to their drives. The acronym stands for Self-Monitoring Analysis and Reporting Technology. With it drives have technology to monitor aspects of their status and report it. It has become an industry standard. Whether it is used in your direct-attached storage or drive arrays is another matter.

While the drive is spinning diagnostic checks are made. Anything that is non-standard is noted and, if it persists, the diagnostic system is triggered into sending an alert. Things that are monitored include disk spin speed, sector-level faults, a need for recalibration, spin-up time, head:disk distance, the temperature of the drive, and various aspects of the drive motor, the media and the servo mechanisms.

The ocurrence of errors can be noted and compared to standard performance parameters encoded in the diagnostic system. Let's suppose the drive begins to take longer to reach spin speed, and that more retries are needed to attain full rpm, then it can indicate that the drive's bearings or motor are likely to fail.

Another example would be an increased need for error correction on read data. This could indicate platter surface contamination (restricted to a few disk blocks) or a read head problem (applies to all blocks on a platter).

The S.M.A.R.T. system can only detect about 70 percent of likely failures. It might not seem that smart but it is seventy times better than 0 percent.

Smart disk vendors
- Hitachi GST has a Drive Fitness Test which uses S.M.A.R.T. diagnostics built into Hitachi drives.
- Maxtor has got S.M.A.R.T. technology. For example see here.
- Seagate's SeaTools is its diagnostic suite for Seagate drives. It comes in desktop and enterprise versions and there is even a SeaTools Online version.
- Western Digital has its Data Lifeguard which is S.M.A.R.T.-enabled.

What about drive array vendors? All good arrays will come with diagnostic monitoring that uses S.M.A.R.T. facilities. Some examples:

- EMC has its CLARalert suite for Clariion arrays,
- Dell's Fibre CHannel PowerVault 660F is S.M.A.R.T.-enabled,
- HP's Smart Array 6400 Controller has S.M.A.R.T. technology features,
- LSI Logic's Global Array Manager for RAID arrays has it too.

Often the technology is several layers down inside an overall diagnostic suite.

What can you do if you have a JBOD or small server with internal drives?
Santoods provides an application, claiming it's the only S.M.A.R.T. disk monitoring software that supports SCSI, Fibre channel, IDE, and SSA peripherals on UNIX and Windows Platforms. Techworld reviewed it here.

Shareware SMART application software is available (look with Google for Ariolic) but most probably is not of interest to enterprise users.

With the increasing use of SATA and ATA drives in nearline or secondary storasge applications the use of S.M.A.R.T. monitoring to help prevent disk crashes and subsequent data loss becomes more important.