Anyone with experience working on a complicated problem with a tight time constraint has probably wished for “the answer”, or at least “a hint”.

This feeling likely rings true with many IT personnel. Downtime and failures are unavoidable due to issues like complicated systems, inherent lifespans of underlying hardware resources, and human error. A single bug in a code snippet could bring down a whole site1, 2. No matter your safeguards, failures will occur. How quickly can you detect and recover from system failures is critical to maintaining system availability.

Consider an online retail store. At a certain point, the website stops responding to user requests. It may take a while before the IT administrator becomes aware of the problem, during which a number of users have exhausted their patience waiting for the webpage to load. The administrator will need to pinpoint the cause of the problem. Speed is of the essence.

The administrator will run a number of checks but troubleshooting is hampered by system complexity. Disparate systems that may seem functional as individual parts are somehow dysfunctional as a whole.

As the IT administrator determines the problem, there are two possible scenarios: if it is a known fault, the administrator knows what to check and the fix is known. Resolving this scenario relies largely on the experience of your IT personnel.

Or the fault is unknown and the administrator has never experienced a similar problem. In this instance, it is best to probe each component and rule out possibilities. Resolving this scenario quickly relies on the resourcefulness and aptitude of your IT personnel.

Indeed, this troubleshooting qualifies as a complicated problem with a tight time constraint. Today’s troubleshooting is largely a manual effort. Humans play a critical role in the whole process: from observing the abnormal system signals, to identifying the problems, and applying the fix.

The task of identifying the system failure is urgent, yet completing the diagnostics to pinpoint the issue is time consuming. The administrator will likely need to repeat each check-up step. Each time this leads to an unnecessary delay in failure recovery during which time the system may go offline completely. Or the situation worsens: as enterprise systems become more dynamic, a failure may trigger automated scaling mechanisms that work to alleviate the issue but also multiply the number of machines that need to be checked.

But the truth is out there. Systems monitoring captures resource usage of the physical server, virtual machine, and the application. Log files record events that occur. This information can identify the fault, however the answer is buried within the daunting amount of data and the human has a hard time making sense of it all.

The key is to distinguish the failure from the system’s normal behaviors as well as other failures. For each failure there is a failure pattern in terms of a profile of related events or resource usage metrics. Detecting this pattern is hard for humans, but is relatively easy for machines.

With the detected failure patterns, a running fault detection system can automatically identify a failure condition if monitored syndromes match with the pattern. This problem is not a big data play which will help after the failure and may be useful in generating the patterns, but a real-time detection problem. With a bit of reasoning, even an unknown system failure could raise an alert if it gets close enough to a detected failure pattern.

If using an automated fault detection system, when the website stops responding to user requests, the administrator no longer needs to come up with the list of check-up steps. Instead, the fault diagnostic system generates a checklist automatically and therefore certain steps that were in the fixed checklist can be skipped. Other steps can be prioritised and automatically executed based on the suspected fault.

Beyond these tools, we can use an approach that focuses on the relationships between the monitoring metrics and log events to identify faults in order to create fault patterns. Specifically a pattern is a grouping of correlated events to an experienced fault. These patterns don’t rely on deviation from a previously running system, but instead leverages events over running instances. Using the pattern, a running failure detection system can identify past faults and triggers, which may signal early notification of possible unknown faults.

Humans will always play a critical role in identifying the system failures and applying the remedy. However, as systems become more complex and IT becomes more critical, so does the complicated problem of failure detection. The role of automated fault detection that looks for patterns within the log and monitoring data, will give you the answer or at least a hint.

Poster by Teresa Tung, Senior Manager, Accenture Technology Labs and Qian Zhu Researcher, Accenture Technology Labs