Hard disks are far less reliable than disk vendors routinely claim, a study by Carnegie Mellon University has suggested.
The research, presented at the recent 5th USENIX Conference on File and Storage Technologies in San Jose, found that replacement rates in a sample of 100,000 drives used in high-performance environments was between two and four percent annually, which contrasts with the official industry mean time between failure (MTBF) failure rates said to work out at 0.88 percent per annum. On some systems, failure rates were observed to be up to 13 percent.
The study also found no evidence that expensive Fibre Channel (FC) drives were any less likely to fail than the cheaper and slower Serial ATA (SATA) drives.
Garth Gibson of Carnegie Mellon emphasised that the study didn’t claim to track real drive failures, only those instances where the customer believed the drive needed to be replaced. The University has not published any information relating the replacement rates to specific vendors.
Adopting a cautious tone, he also repeated vendor claims that as many as half of the drives returned to vendors turned out to have no problems, though it could also be pointed out that this is a statistic that comes from the vendors themselves, and can’t be verified.
Drive vendors have mounted a defence of MTBF statistics, shielding themselves by claiming drive failure is dependant on a wide variety of factors from which it is hard to draw general conclusions. Interestingly, according to a Google study presented at the same USENIX conference, one of these factors might not, as is commonly assumed, be heat build-up.
In a study of 100,000 ATA and SATA drives used in the company’s data centres, it discovered that temperature appeared not to have a bearing on failure rates. This is a surprising statistic as it is typically assumed that keeping temperature down will reduce failure rates across all computing components.
"That doesn't mean there isn't one," said Luiz Barroso, an engineer at Google and co-author of the paper, on the effect temperature might be playing. But it does suggest "that temperature is only one of many factors affecting the disk lifetime," he said.
Drive vendors have played down the findings of the two studies. "The conditions that surround true drive failures are complicated and require a detailed failure analysis to determine what the failure mechanisms were," said a spokesperson for Seagate Technology, in an e-mail sent to Computerworld US.
"It is important to not only understand the kind of drive being used, but the system or environment in which it was placed and its workload."
At Hitachi Global Storage Technologies, the tone was similar. "Regarding various reliability rate questions, it's difficult to provide generalities. We work with each of our customers on an individual basis within their specific environments, and the resulting data is confidential," said a spokesperson.
It's the second blow that Google has aimed at hard drive vendors recently. Just two weeks ago, Google found that the SMART diagnostic tool used to determine faults could only predict half of all disk failures.
Robert L. Scheier of Computerworld US contributed to this story.