I recently joined my team in troubleshooting a complex infrastructure problem affecting the private cloud that hosts our community electronic health records system. The incident put me in mind of the things I have learned from such experiences over the years.
1. Once the problem is identified, ascertain the scope. Call the users and ask them what they are experiencing. Test the application or infrastructure yourself. Do not trust the monitoring tools if they indicate all is well but the users are complaining.
2. If the scope of the outage is large and the root cause is unknown, raise alarm bells early. It's far better to make an early all hands intervention with occasional false alarms than to intervene too late and have an extended outage because of a slow response.
3. Bring visibility to the process by having hourly updates, frequent bridge calls and multiple eyes on the problem. Sometimes technical people become so focused, they do not have a sense of time passing or insight into what they do not know. A multidisciplinary approach with predetermined progress reports prevents working in isolation and the pursuit of solutions that are unlikely to succeed.
4. Although frequent progress reports are important, you must allow the technical people to do their work. Senior management feels a great deal of pressure to resolve the situation. However, if 90% of the incident response effort is spent informing senior management and managing hovering stakeholders, then the heads-down work to resolve the problem cannot get done.
5. Remember Occam's razor: The simplest explanation is usually the correct one. In our recent incident, all the evidence pointed to a malfunctioning firewall component. But all vendor testing and diagnostics indicated the firewall was functioning perfectly. Some hypothesised that we had a very specific denial of service attack. Others suggested a failure of Windows networking components within the operating systems of the servers. Others thought we had an unusual virus attack. We tested the simplest explanation by removing the firewall from the network, and everything came back up instantly. It's generally true that complex problems can be explained by a single simple failure.
6. It's very important to set deadlines in the response plan to avoid the "just one more hour and we'll solve it" problem. This is especially true if the outage is the result of a planned infrastructure change. Set a backout deadline and stick to it. This is similar to when I climb or hike, I set a time to turn around. Summiting is optional, but returning to the car is mandatory. Setting milestones for changes in course and sticking to your plan regardless of emotion is key.
7. Over-communicate to users. Most stakeholders are willing to tolerate downtime if you explain the actions being taken to restore service. Members of senior management need to show their commitment, presence and leadership of the incident.
8. Do not let pride get in the way. It's hard to admit mistakes, and challenging to acknowledge what you do not know. There should be no blame or fingerpointing during an outage resolution. After-action debriefs can examine the root cause and suggest process changes to prevent outages in the future. Focus on getting the users back up rather than maintaining your ego.
9. Do not declare victory prematurely. It's tempting to assume the problem has been fixed and tell the users all is well. I recommend at least 24 hours of uninterrupted service under full user load before declaring victory.
10. Overall, IT leaders should focus on their trajectory, not their day-to-day position. Outages can bring many emotions: fear for your job, anxiety about your reputation, sadness for the impact on the user community. Realize that time heals all and that individual outage incidents will be forgotten. By taking a long view of continuous quality improvement and evolution of functionality rather than being paralysed by short term outage incidents, you will succeed over time.
Outages are painful, but they can bring people together. They can build trust, foster communication and improve processes by testing downtime plans in a real world scenario. The result of our recent incident was a better plan for the future, improved infrastructure and a universal understanding of the network design among the entire team, an excellent long term outcome. I apologised to all the users for a very complex firewall failure, and we've moved on to the next challenge, regaining the trust of our stakeholders and enhancing clinical care with secure, reliable and robust infrastructure.