A system configuration mistake caused the outage that affected Windows Azure customers in western Europe last week, according to Microsoft.
As a result, the Microsoft public cloud application hosting and development platform was unavailable for about two and a half hours on August 2. Microsoft didn't say how many customers were impacted.
At issue was a "safety valve" mechanism in the Azure network infrastructure designed to prevent cascading network failures. It does so by capping the number of connections that network hardware devices accept.
"Prior to this incident, we added new capacity to the West Europe sub-region in response to increased demand. However, the limit in corresponding devices was not adjusted during the validation process to match this new capacity," said Mike Neil, Windows Azure general manager.
A sudden rise in the affected cluster's usage led to the "safety valve" threshold being exceeded, which generated a storm of network management alerts. "The increased management traffic in turn triggered bugs in some of the cluster's hardware devices, causing these to reach 100% CPU utilisation impacting data traffic," Neil said.
At the time, Microsoft solved the problem by increasing the affected cluster's "safety valve" limits. To prevent the situation from recurring, Microsoft is patching the identified bugs in the networking hardware devices, and it is also improving the network monitoring systems, so that they can identify and address connectivity issues before they cause outages.
Forrester Research analyst James Staten said that PaaS (platform as a service) clouds such as Azure are very complex and highly automated environments, and sometimes glitches crop up in production that can't be anticipated in test environments. "This appears to be one of those cases," he said.
Over time as new features, greater use and other factors enter the equation, administrators have to take steps to adjust and optimise the running system, and occasionally something will break, he said.
"Should it be something clients should be concerned about? Not really. It is an example of the kinds of things that can happen in a cloud environment. But far worse things are more common in a typical enterprise data center," Staten said.
IT chiefs and developers planning to host applications in the cloud need to configure them and design them to be fault tolerant. "That is a fundamental shift in thinking most developers and enterprise operations teams need to understand when embarking on cloud deployments," he said.
"These types of outages are learning opportunities for both the cloud admins and cloud customers. Rather than view these incidents as indictments of cloud, they should be seen as opportunities to improve your use of the cloud," he added.