Indeed cloud providers strive to offer high reliability for their services, but part of this promise supposes that tenants use the services correctly. Large scale cloud failures do happen and services can be interrupted.  Tenants that do not properly architect for these cases can be greatly impacted during these outages.

Cloud tenants cannot rely on traditional data centre-based availability solutions: typically tenants have little to no direct visibility or control over the underlying infrastructure resources. So resolution of failures in the underlying compute environment is left to the cloud provider. Therefore, cloud-based applications must leverage other means. 

Tenants augment reliability with high availability (HA) mechanisms that provision more and redirect to new resources when existing resources in the compute environment are compromised. The HA mechanisms often rely on tenant’s correct implementation of mechanisms offered by the cloud provider (e.g., auto-scaling).

Consider the case study published by Netflix. As a cloud tenant, Netflix has built and deployed its core business services (e.g. streaming movies, recommendation) on Amazon Web Services (AWS) since 2010. Bearing in mind that it’s ultimately the cloud tenants’ responsibility to architect resiliency into their applications to operate through and recover from failures, Netflix developed Chaos Monkey, a service that randomly turns off virtual machines (VMs) to proactively mimic environmental failures in order to test Netflix’s recovery mechanisms.  

As such, Netflix can detect and resolve the implementation of mechanisms for as many failure scenarios as possible. The goal is to learn from the failures so that Netflix won’t fail the same way twice.

Chaos Monkey targets the scenario of VMs running using AWS’s auto scaling groups. By default, the auto scaling groups should automatically detect the termination of an instance, and replace it with a new identically configured instance. 

In addition to Chaos Monkey, Netflix created a Simian Army to verify HA mechanisms in response to other types of environment-based failures like an outage of an entire availability zone or the impact of introducing artificial delays REST services to simulate service degradation. The point is to proactively simulate disruption to test the implementation of recovery mechanisms.

Chaos Monkey has helped Netflix improve its resiliency against cloud outages. Netflix services ran without interruption and intervention (albeit with higher latency and higher than usual error rate) through an outage on 21 April. 

However, an approach like Netflix’s is just the first step, and does not cover many scenarios.  Environment-based failures go beyond those based on turning off VMs or adding delays.  Other failures to consider include the following: 

  1. Network failures may cause a set of VMs to be unreachable.
  2. Problems can be caused by overloaded VMs rather than completely dead VMs. 

  3. The purely random approach of turning off VMs does not guarantee that all VMs, or even the critical VMs, are tested. 

  4. Outages can still be caused by security vulnerabilities of the systems. 

  5. VM granularity is not enough:  not big enough to verify data centre or regional failures, nor small enough to verify degradation or failures of the services running on the VM.

Testing against a library of “what if” failure scenarios becomes an essential step in application deployments to cloud.  

To ensure reliability, applications on cloud require a new way to test that perturbs the underlying environment.  The role of the monkey is to cause problems and to create chaos.

Only then can you verify the effectiveness of your automated recovery mechanisms. When testing for cloud, introducing failures becomes an essential step.  You need to fail often so you don’t fail when it counts.  

Posted by Teresa Tung, Manager, Accenture Technology Labs and Qing Xie, Researcher, Accenture Technology Labs