A university network brought to its knees when someone inadvertently plugged two network cables into the wrong hub. An employee injured after an ill-timed entry into a data centre. Overheated systems shut down after a worker changes a data centre thermostat setting from Fahrenheit to Celsius. These are just a few of the data centre disasters that have been caused not by technological malfunctions or natural catastrophes, but by human error.
According to the Uptime Institute, a research and consulting organisation that focuses on data centre performance, human error causes roughly 70% of the problems that plague data centres today. The group analysed 4,500 data centre incidents, including 400 full downtime events, says Julian Kudritzki, a vice president at the Uptime Institute, which recently published a set of guidelines for operational sustainability of data centres.
"I'm not surprised," Kudritzki says of the findings. "The management of operations is your greatest vulnerability, but also is a significant opportunity to avoid downtime. The good news is people can be retrained."
Whether it's due to neglect, insufficient training, end user interference, tight purse strings or simple mistakes, human error is unavoidable. And these days, thanks to the ever-increasing complexity of IT systems, and the related problem of increasingly overworked data centre staffers, even the mishaps that can be avoided often aren't, says Charles King, an analyst at Pund-IT.
"Whenever you mix high levels of complexity and overwork, the results are typically ugly," says King. And as companies become more reliant on technology to achieve their business goals, those mistakes become more critical and more costly.
Wrong worker, wrong cable
Take the example of the university data centre switch that overloaded because an IT worker mistakenly plugged two network cables into a downstream hub. That happened about four years ago at the Indiana University School of Medicine, according to Jeramy Jay Bowers, a security analyst at the school.
The problem arose out of less-than-optimal network design, says Bowers, who worked at the school as a system engineer at the time of the incident. The IT department for the school of medicine was split into two locations, with one room in the school of medicine building and another room at the neighboring university hospital, not an ideal setup to begin with, says Bowers.
The department had run fibre, a purple cable to be exact, from a switch in the first building to the second, running it up through the ceiling, through a set of doors and across to the hospital's administrative wing next door. That cable attached to a 12-port switch that sat in the hospital building's IT room, and staffers could easily disconnect from the school of medicine network and connect to the hospital network through a jack in the wall, Bowers explains.
One day, Bowers had taken some personal time and was out for a jog when his iPhone rang, the switch in the school of medicine's server room was overloaded, causing denials to every service it hosted. "The green lights go on and off when packets pass through," he explained. "It had ramped up until the lights were more on than off."
Bowers quickly began troubleshooting over the phone. He was able to determine that nothing on the school of medicine's network had changed. Then he remembered that purple cable. He told his co-worker on the phone to unplug it, and activity on the switch went back to normal. Then he had his co-worker plug it back in and the switch overloaded again, proving that the problem was at the other end of the purple cable, in the university hospital building.
It turned out that an IT staffer who was normally based out of a satellite location came to the university hospital's IT room to work on a project and needed extra connectivity. He inadvertently created a loop by plugging two network cables from the university switch into a hub he had added to the network so he could attach additional devices. "So it kept trying to send data around in a circle, over and over," says Bowers, which in turn caused the switch in the school of medicine building to overload.
Bowers says the network was cobbled together like that when he began working at the university, so he inherited the setup, which a better approach to network planning and design would have no doubt flagged as problematic. But at least now the IT department knows one scenario to avoid going forward: Jury-rigged cabling and traveling techies can be a bad mix.
"We didn't do an official lessons learned [exercise] after this, it was just more of a 'don't do that again,'" says Bowers. However, this event, combined with another incident where a user unwittingly established a rogue wireless access point on the school of medicine's network and overloaded the switch, has convinced Bowers of one thing: "I hold to the concept that human errors account for more problems than technical errors," he says.