Microsoft is granting customers affected by the string of outages of its Azure cloud services on February 29 a 33% credit for the entire billing month.
The problem stemmed from two overlapping circumstances: that February 29 comes around only every four years and that when Azure initialises virtual machines for customer applications, a certificate is exchanged and given a valid-to date of one year. When certificates were issued starting 4pm PST on February 28, they were given a valid-to date of February 29, 2013 which won't occur and was therefore interpreted as invalid.
This glitch set off a series of retries that also failed and led to the conclusion by the system that the hardware on which the virtual machines were running had failed. That led to attempts to migrate the failed virtual machines to other server hardware within the same Azure cluster, which consists of about 1,000 physical servers.
The migrated VMs also failed to initialise for the same reason and more and more hardware was judged failed until a threshold was reached that halted all attempts to reincarnate virtual machines anywhere in affected clusters. That allowed those clusters to stay in service at reduced levels, the Azure blog said.
Azure also shut down the customer service management platform so customers couldn't add applications or expand capacity for running applications, both of which would have made the problem worse by calling for new virtual machines. "This is the first time we've ever taken this step," Microsoft explained. Running applications were left intact.
It took 13 hours and 23 minutes to patch the bug in all but seven Azure clusters. Those seven were in the midst of a software upgrade, and so posed a separate problem. Should the host agents and guest agents that were exchanging the invalid certificates be upgraded to the newest patched versions or restored to the old versions but patched?
They decided on the latter, but that didn't work out because they didn't also revert to an earlier version of the network plug-in that configures a virtual machine's network. The new network plug-in was incompatible with the old host agents and guest agents. The result was that all virtual machines in those seven clusters were disconnected from the network.
The affected clusters included servers for Access Control Service (ACS) and Windows Azure Service Bus, both of which failed as a result. Cleaning up this problem entirely took until early morning on March 1.
Microsoft is taking three steps to prevent a similar problem. First, it will test for time incompatibilities in its software. It will also change its fault isolation so the system doesn't assume a hardware failure in this type of circumstance. And third, it will allow for a graceful degradation of customer management rather than turning the platform off altogether. This will allow blocking new virtual machines or expansion of old ones but continue to allow management of existing virtual machines.
The company is also upgrading its detection so issues are discovered and addressed more quickly. It is also upgrading the customer dashboard to remain more available in crises.
Because customer service lines were swamped, customers had to wait a long time for help, so the company is reevaluating staffing and considering better use of blogs, Twitter and Facebook to get the word out about problems.
To help during recovery from outages, the company is creating internal software tools, setting priorities to reestablish customer services more quickly and give customers better visibility into what progress is being made to restore services.