This is the first in a series of stories that will focus on datacentre improvements. Some of the ideas will increase capacity, and others will increase redundancy; the last group will improve the overall efficiency and reliability of the electrical and mechanical infrastructure.

Each of these suggestions has been installed and tested in live environments.

Datacentre staffers are challenged when processing capacity is increased within existing facilities. While the reliability of hardware, software and networks has been improving, electrical and mechanical infrastructure improvements lag behind. Forensic evaluations of datacentre failures demonstrate that operator errors, electrical and mechanical single points of failure, design problems, and construction defects are the leading causes for datacentre disruptions.

This situation is bound to be made worse as more datacentres are relocated or expanded over the next five years. Rakesh Kumar, an analyst at Gartner, said that more than 70 percent of the Global 1,000 organisations will have to modify their datacentre facilities significantly during the next five years.

"These legacy datacentres typically were built to a design specification of about 100 to 150 watts per square foot. Current design needs are about 300 to 400 watts per square foot. By 2011, this could rise to more than 600 watts per square foot," Kumar said. "The implication is that most current datacentres will be unable to host the next generation of high-density equipment, so CIOs will have to refurbish their established sites, build new ones or look for alternatives, such as using a hosting provider."

Unfortunately, the compaction of space required by IT hardware has resulted in unprecedented increases in power and cooling needs, outstripping facility infrastructure, design standards and space allocations. "Back of the house" spaces for power and cooling to support high-density computing are, in many datacentres, larger than the computer area itself. Electrical and mechanical areas can be 400 percent larger than the raised-floor computing space in 250-watt-per-square-foot environments.

At the same time, facility infrastructure support is short-changed because datacentre infrastructure represents such a small portion of the real estate market and because the finances relative to the revenue are small. Datacentres represent less than one tenth of one percent (0.1 percent) of all real estate construction in the US. In addition, these are lightly occupied or unoccupied buildings. Some are actually "lights-out" facilities that are fully automated, without any occupants.

Also, in a datacentre environment, annual facility costs, including infrastructure depreciation, is as little as 0.5 percent (one-half of one percent) of the IT budget. In a large company, the costs to operate and maintain the electrical and mechanical infrastructure can be less than one-thousandth of one percent of annual revenue, less than a rounding error. These small costs don't generally get much high-level attention.

Furthermore, datacentres may be small areas located within a much larger building, camouflaging the true operational risks and utility expenses. For example, an international pharmaceutical company recently migrated a 1,000-square-foot, high-density server room into its 50,000-square-foot office building. The utility bill for the entire building doubled - and has remained at that level for the past nine months.

EPO problems

This leads us to the first low-cost, low-risk, high-benefit opportunity for improving the reliability of your data centre's critical power system: inspect your emergency power off (EPO) switch.

These innocuous buttons are located at the exits of most datacentres. Once pushed, critical power is shut down and can be reactivated only manually, typically by an electrician who knows the system. Disruptive events due to EPO include abnormal incidents that have shut down emergency 911 access and that have interrupted international trading, corporate accounting, pharmaceutical research and air traffic control.

Virtually every industry that relies on central datacentre functions has experienced EPO disruptions.

While some of the EPO disruptions were caused by faulty wiring, under-floor cable pulls snagging the EPO conduit, water leaks and poor maintenance, the majority of datacentres shut down by EPO activation were caused by a human pushing an EPO button in error. In many cases, the activation was the result of an occupant pushing buttons near the exit thinking they were deactivating magnetic security locks.

In at least one recent case, the EPO disruption was done on purpose: a systems administrator shut down a datacentre that controls the Californian electrical grid.

Hundreds of incidents across the US are reported annually in datacentres. These are the same facilities where millions of dollars were originally invested to achieve electrical fault-tolerance and continuous availability. Every IT, network and telecommunications component powered in a raised-floor area is at risk.

Still, in the US, the EPO button is required by Articles 645.10 and 645.11 of the National Electrical Code. These rules mandate that computer rooms have an EPO system at each exit to disable power under the raised floor as well as to disable power to air conditioning that supplies cooling to the raised floor. By code, the disconnection mechanism may be a single button or two adjacent buttons - one for power, the other for cooling. The Provision And Use Of Work Equipment Regulations in the UK aren't as specific but the implementation is often the same or similar.

But all too often, these EPO buttons are placed next to the many other exit-mounted devices, including fire-suppression release/abort buttons, light switches, security card readers, fire extinguishers, fire alarm panels, telephones, security intercoms and exit buttons.

This confusing conglomeration next to the exit door can easily allow datacentre occupants to select the EPO when they were simply trying to turn on the lights or call security.

Even momentary pushes on the EPO button will shut down the datacentre and require maintenance staffers to reset all tripped electrical devices. Electrical reset could take up to 30 minutes - this in an environment where a fraction of a second can cause irreparable damage to hardware, databases and corporate profits.

It is probable that this single point of failure is one of the leading causes of critical power loss in the US. These electrical disruptions occur with the same regularity as utility disruptions, engine-generator failures and nuisance circuit-breaker trips, but they are generally not seen as failures. Because the button is pushed on purpose, whether by mistake or not, these are considered accidents but not the same as utility disruptions.

EPO button

But there is a way of making the EPO button less hazardous to your datacentre's health. There is a protocol that has been tried for more than a decade in dozens of datacentres around the country. It could be implemented in your datacentre within a few hours and a few hundred dollars per exit - truly a small price to pay to eliminate a common source of risk in a modern datacentre.

In the photo above, note that the EPO is clearly marked as the "Emergency Power Off Button." The intent is to distinguish it from the other devices at the doorway of the datacentre. Note that the cover over the EPO has a keyed lock, but the key is already inserted. Opening the case will need to be very intentional - but if a real emergency existed, the lock would not be an impediment.

Under the cover is a battery-operated micro-switch that sounds an audible 90-decibel (piercingly loud) alarm, while instantly alerting security through a second micro-switch that the cover has been lifted. A phone is a few feet away from the EPO switch.

Additional EPO requirements include having a system that can be serviced and maintained fail-safe. That means it can be maintained while critical load is being powered. Many clients are terrified of changing a burned-out light bulb on their EPOs for fear of accidentally shutting down their datacentres.

Some other designs we have developed require the closure of two latching push buttons that need a key to release. Others have the alarm switch in the EPO cover simultaneously cause a video camera to rotate and film the EPO because disgruntled employees have perpetrated some malicious EPO activations.

As a final consideration, the label can be expanded to read "EPO (emergency power off) button. This will shut down all equipment in this room. Use for life saving emergency only." It may be a good idea to have the sign in an alternative language that is used by non-English-speaking occupants.

More than 30 years ago, code officials determined the need for EPOs because power installed under a raised floor could start a concealed fire. Also, since there are so many circuit breakers in a datacentre, it is difficult to determine a source for disconnecting if someone is being electrocuted. Modern components that mitigate the need for EPO include fire-/smoke-detection systems under the raised floor and "ground fault (GFI)" settings in circuit breakers.

Actual cases where EPO activation has saved lives are non-existent. The Canadians are intentionally attempting to remove this requirement from their codes. Unfortunately, not unlike many examples in building codes, once approved, they are very difficult to exorcise from the code books.

Edward C. Koplin, professional engineer and certified energy manager, is a principal of consultancy X-nth. He has been an advisor to the Site Uptime Institute and has evaluated, designed and commissioned over 3 million square feet of Fortune 500 datacentres. He can be reached at [email protected]