There's always a reason why things break in IT, and the powers-that-be can usually find someone to blame, be it a data centre staff member, an OEM, a systems integrator or a third party service provider.
An offender often leaves clear fingerprints showing that a component was mislabelled or a process wasn't updated. In other cases, an incident may be the result of oversights by multiple parties. But with the possible exception of a meteor strike, there's always someone to blame for a data centre problem.
The majority are blamed on outside parties such as contractors or vendors, with a sizeable percentage of fault assigned to data centre operations staff, according to data compiled by the Uptime Institute.
The findings of the Uptime Institute, which has been collecting incident data from its data centre customers since 1994, may draw criticism as few internal IT operators or their vendors take blame easily.
Kicking the vendor
Vendors may be blamed most because they are usually willing to take a bullet for a problem even if they feel the genesis is an internal operations oversight.
"The vendor gets caught up in a sensitive spot," said Ahmad Moshiri, director of power technical support at Emerson Network Power Liebert Services, because it doesn't want to put the client, a facilities manager, in a difficult position. It's very touchy," he said.
Uptime Institute members, data centre managers from multiple industries, agree to voluntarily report abnormal incidents. The institute has about 5,000 abnormal incidents in its database. Such incidents are defined as any event in which a piece of equipment or infrastructure component did not perform as expected.
The data compiled by Uptime found that 34% of the abnormal incidents in 2009 were attributed to operations staff, followed by 41% in 2010, and 40% last year.
External forces who work on the customer's data centre or supply equipment to it, including manufacturers, vendors, factory representatives, installers, integrators and other third parties were responsible for 50% to 60% of the incidents reported in those years, according to Uptime.
Some 5% to 8% of the incidents each year were tied to things like sabotage, outside fires, other tenants in a shared facility and various odd anomalies. About 10% of all the reported abnormal incidents resulted in an outage ranging from a system losing power to a data centre going out.
Don't hire stupid
The Uptime data shows that internal staff are responsible for a majority (60%) of those incidents, which can include outages and data loss incidents.
Although the internal staff gets the blame, "it's the design, manufacturing, installation processes that leave banana peels behind and the operators who slip and fall on them," said Hank Seader, managing principal research and education at Uptime.
To Seader's point about banana peels, David Filas, a data centre engineer at healthcare provider Trinity Health described a situation where a fire system vendor, performing routine maintenance on a fire suppression system in one data center, triggered an emergency power off (EPO).
Ordinarily, this would not have been a problem, but an error in the construction of the EPO circuit let the signal through, which resulted in an outage. It turned out that the EPO bypass circuit was not constructed to the as-built drawing when the centre was built years earlier.
"The designs and actions of engineers, architects, and installation contractors can have latent effects on operations long after construction," said Filas.
Filas believes that "outside forces can make or break the data centre just as easily as internal forces". But he also sees risk levels rising, particularly as data centres rely more on external suppliers.
Electrical contractors, for instance, may not understand the specific needs of a data centre. "We are frequently questioned on why we provide redundant power to racks," said Filas.
Jeff Pederson, manager of data recovery operations at Kroll Ontrack, looks at the root causes of data loss and sees problems caused by both internal staff and external providers. But, he added, service people attempting to get equipment up and running "tend to cause a lot of the damage we see".
"The sole goal [of some service techs] is to get that equipment working and operational; it is not necessarily to protect the data that the customer has," said Pederson.
Kroll said the end result of such attitudes often leads to this complaint from users: "My system works now but my data is all gone."
Down and out
Data losses and outages are about the worst things that data centres can deal with. In most years, Uptime members reported about two dozen outages; last year the number declined to seven.
The drop in outages coincided with the lowest level of data centre equipment installations since 2008, said Seader. He also credits an improved focus on processes and procedures by the reporting companies.
Emmerson's Moshiri cites process and procedural issues as a leading cause of problems, particularly when multiple vendors are involved and a high degree of coordination is needed.
Oftentimes critical pieces of information such as power diagrams or even the physical location of equipment may be out of date and incomplete, said Moshiri. Maintenance is another issue, said Moshiri. Facility managers may disregard an OEM's recommendation that maintenance on a particular device be conducted, for instance, twice a year.
Steve Fairfax, president of MTechnology, applies Probabilistic Risk Assessment (PRA), which is used in the nuclear industry, on IT equipment. The study concluded that too much maintenance is a major source of problems.
The PRA model uses all the data they know about individual components and then combines them in a mathematical model that represents how the entire system works, whether that system is a nuclear power plant or a data centre.
Fairfax says his mathematical models makes the case that the amount of maintenance in data centres "is grossly excessive by a factor of 10" and is responsible for a great deal of downtime. "Messing with perfectly functioning equipment is highly profitable," said Fairfax.
Fairfax said if you want to take data centers to the next level of reliability and have them crash as infrequently as airplanes, "then we have to do the same things that jet airplanes do," and train data centre operators in simulators.
It also means developing different maintenance criteria. "More is not always better because when you do maintenance on an airplane that means taking it apart and when you take it apart you can sometimes put it back together wrong," said Fairfax.