Definition: Mean time between failures (MTBF) and the related mean time to failure (MTTF) are measures of hardware reliability, usually expressed in hours. They indicate in statistical terms the working lifetime of a given component: The higher the figure, the more reliable the product.
It's a cruel world out there in the data centre. Nothing lasts forever, especially not mechanical devices with fast-moving parts, such as disk drives and printers. It would be very useful if we could predict when something might break or, at the very least, determine which of two similar products would be less likely to break in a given period. The answer is MTBF, short for mean time between failures, and the closely related MTTF, short for mean time to failure. Both are measures of reliability that are defined statistically as the number of hours a component, assembly or system will operate before it fails.
MTTF and MTBF are sometimes used interchangeably, but they are in fact different. MTTF refers to the average (the mean, in arithmetic terms) time until a component fails, can't be repaired and must therefore be replaced, or until the operation of a product, process or design is disrupted. MTBF is properly used only for components that can be repaired and returned to service. This introduces a couple of related abbreviations occasionally encountered: MTTR (mean time to repair) and, less common, MTTD (mean time to diagnose). With those notions in mind, we could say that MTBF = MTTF + MTTD + MTTR.
MTBF sounds simple: the total time measured divided by the total number of failures observed. For example, let's wring out a new generation of 2.5-in. SCSI enterprise hard drives. We run 15,400 initial units for 1,000 hours each (thus our tests take a little less than six weeks), and we find 11 failures. The MTBF is (15,400 x 1,000) hours/11, or 1.4 million hours. (This is not a hypothetical MTBF; it represents current drive technology in 2005.)
What does this calculation really mean? An MTBF of 1.4 million hours, determined in six weeks of testing, certainly doesn't say we can expect an individual drive to operate for 159 years before failing. MTBF is a statistical measure, and as such, it can't predict anything for a single unit. We can use that MTBF rating more accurately, however, to calculate that if we have 1,000 such drives operating continuously in a data centre, we can expect one to fail every 58 days or so, for a total of perhaps 19 failures in three years.
The MTBF figure for a product can be derived from laboratory testing, actual field failure data or prediction models such as MIL-HDBK-217 (the Military Handbook for Reliability Prediction of Electronic Equipment, published by the US Department of Defense).
MIL-HDBK-217 contains failure-rate models for various parts used in electronic systems, such as integrated circuits, transistors, diodes, resistors, capacitors, relays, switches and connectors. These failure-rate models are based on a large amount of field data that was analyzed and simplified by the Reliability Analysis Center and Rome Laboratory at Griffiss Air Force Base in Rome, N.Y. (Instructions for downloading MIL-HDBK-217 can be found here.)
Kay is a Computerworld contributing writer in Worcester, Mass. You can contact him at [email protected].