Netflix streaming service was knocked down again on Christmas Eve 2012 when an Amazon developer inadvertently deleted part of the Elastic Load Balancing (ELB) state data used to manage the configuration of the ELB load balancers. This accident resulted in the outage of a number of ELBs and eventually brought down the businesses that rely on Amazon ELB including Netflix.

Famously Netflix has a strong reputation for designing and building highly resilient architecture for their services on the cloud. Netflix’s open-sourced Simian Army (which includes the famous Chaos Monkey) aims to improve resiliency by proactively failing part of Netflix’s services to make sure their system can still function correctly regardless. If Netflix observes any abnormal behavior during this proactive failure perturbation, its engineers address the issue so that the same failure won’t happen again. So why was Netflix impacted even with its proactive approach?

As mentioned in our previous article, simulating IT failures go beyond turning off virtual machines (VMs) or adding delays randomly. In this case, the failed ELBs could not pass requests to the servers behind them. It may be that Neflix never expected the ELB to cause problems. Or, perhaps Netflix did take into account ELB failures, but never had the opportunity to test this particular order of events.

Given that Netflix implements its streaming services for over a thousand different streaming devices (e.g., PS3, iPad and iPhone) where each group of similar devices depends on specific ELBs, it is hard to test all the combinations of scenarios. It may be that only certain combinations of requests and failed ELBs leads to revealing the problems and subsequently building defenses around it.

Netflix’s current Simian Army lacks a way to systematically simulate the failure scenarios consisting of a sequence of events (rather than single failures) and to automatically run the test cases to confirm the system resilience. We have identified some ways to address the limitations of Simian Army.

First, what types of failures do we have? Beyond Chaos Monkey’s VM failures and delays, there are other failures including machine crash, network failures, disk failures, and CPU overload.

Second, where do failures occur? Failures can manifest at different levels on the cloud, e.g. at the VM level when a VM is overloaded, at the physical machine level if a disk drive corrupts, at the availability zone level where a power outage causes all co-located machines to shut down, at the region level that experiences a natural disaster. The failures can also happen within the services provided by the cloud provider or third-party vendors. To capture different levels of failures, we can map each virtual machine instance to groups where instances in the same group are associated to the same cloud provider, region, physical machine, ELB, Auto Scaling Group (ASG), etc. Then we simulate different levels of failures by applying failures to these groups of VMs.

Third, how can we leverage the above failure types and failure location groups to mimic both expected and unexpected failures? It’s possible to list all failure scenarios, but it’s hard, if not impossible to exhaustively try all the combinations, thus we need a prioritised failure selection strategy. Suppose we create a set of failure coverage criteria where at least one failure is injected in each criterion, for example:

  • Cloud-coverage: When leveraging multiple clouds
  • Region-coverage: Using Amazon EC2’s definition of region representing VMs geographically dispersed in areas and countries
  • Availability Zone (AZ) coverage: AZ are distinct locations that are connected with low latency but shielded from each other’s failure in the same region
  • ASG-coverage: ASG is a service offered by Amazon which allows users to automatically scale up or down the capacity within the group based on the user-defined conditions
  • ELB-coverage: ELB is a service offered by Amazon, we can find alternative load balancers serve the same purpose, e.g., HAProxy;
  • Machine-coverage: at least one failure be injected in each physical machine if we can discern
  • VM-coverage

This above ordering of criteria can govern the random selection of the VM and the injected failure. For example, if we kill an instance within a certain ASG this time, we would want to kill another instance in another ASG as instances within one ASG tends to behave similarly. This prioritization limits the number of tests needed to explore the failure scenarios.

However, failure scenarios go beyond just one VM failure at a time, there are many cases where multiple failures occur simultaneously. For example, a network failure prevents ELBs from routing requests to any VMs. Like in the single failure case, we should have a coverage criterion across groups of VMs where failures are injected at the same time.

Last but not the least, how do we generate test cases that determine whether our resiliency solution works when failure happens? As in the ELB failure, only requests from certain devices that connected to the problematic ELBs were affected. Other devices that connected to undisturbed ELBs experienced no change. To that end we can associate a test case with each component so that when we fail we can select the right test cases to target the failed component. This association allows a method to automatically test the correct functioning of the system.

Simian Army is the right approach. But let’s give the monkey some directions. Sure, if Shakespeare can happen, so can random testing, but we don’t have forever. With some guidance, we can guide the Simian Army to systematically simulate and cover as many failure scenarios as possible.

Posted by Teresa Tung, Senior Manager, Accenture Technology Labs and Qing Xie  Researcher, Accenture Technology Labs