The Outlook.com partial outage lasting 16 hours on Tuesday and Wednesday morning was caused by a firmware update gone awry that triggered a temperature spike in a Microsoft data centre, resulting in automatic safeguards that made a large number of servers inaccessible.
Because of the unspecified safeguards, downed servers couldn't fail over on their own so restoration work had to be done manually, slowing down the process, according to a blog post by Microsoft Outlook.com Vice President Arthur de Haan.
De Haan apologized for the disruption of email access. "Outages are something we take very seriously and invest a significant amount of our time and energy in doing our best to prevent."
His description of what happened actually happened doesn't detail what software was being updated, what went wrong, what overheated, what safeguards kicked in or how many servers were involved: "On the afternoon of the 12th, in one physical region of one of our datacenters, we performed our regular process of updating the firmware on a core part of our physical plant. This is an update that had been done successfully previously, but failed in this specific instance in an unexpected way. This failure resulted in a rapid and substantial temperature spike in the datacenter. This spike was significant enough before it was mitigated that it caused our safeguards to come in to place for a large number of servers in this part of the datacenter," de Haan's blog says.
"These safeguards prevented access to mailboxes housed on these servers and also prevented any other pieces of our infrastructure to automatically failover and allow continued access. This area of the datacenter houses parts of the Hotmail.com, Outlook.com, and SkyDrive infrastructure, and so some people trying to access those services were impacted."
There was no way to restore the affected infrastructure without human intervention, which he says "added significant time to the restoration."
Microsoft is working on improvements to prevent the same scenario from playing out in the future. "Now that we're through the resolution, we're also hard at work on ensuring this doesn't happen again," he says.