Follow Us

We use cookies to provide you with a better experience. If you continue to use this site, we'll assume you're happy with this. Alternatively, click here to find out how to manage these cookies

hide cookie message


Views from the Lab

Accenture Technology Labs Staff

Testing for cloud - Unleashing a barrel of monkeys

Article comments

Cloud reliability is an important issue highlighted by National Institute of Standards and Technology (NIST) in May 2012

Indeed cloud providers strive to offer high reliability for their services, but part of this promise supposes that tenants use the services correctly. Large scale cloud failures do happen and services can be interrupted.  Tenants that do not properly architect for these cases can be greatly impacted during these outages.

Cloud tenants cannot rely on traditional data centre-based availability solutions: typically tenants have little to no direct visibility or control over the underlying infrastructure resources. So resolution of failures in the underlying compute environment is left to the cloud provider. Therefore, cloud-based applications must leverage other means. 

Tenants augment reliability with high availability (HA) mechanisms that provision more and redirect to new resources when existing resources in the compute environment are compromised. The HA mechanisms often rely on tenant’s correct implementation of mechanisms offered by the cloud provider (e.g., auto-scaling).

Consider the case study published by Netflix. As a cloud tenant, Netflix has built and deployed its core business services (e.g. streaming movies, recommendation) on Amazon Web Services (AWS) since 2010. Bearing in mind that it’s ultimately the cloud tenants’ responsibility to architect resiliency into their applications to operate through and recover from failures, Netflix developed Chaos Monkey, a service that randomly turns off virtual machines (VMs) to proactively mimic environmental failures in order to test Netflix’s recovery mechanisms.  

As such, Netflix can detect and resolve the implementation of mechanisms for as many failure scenarios as possible. The goal is to learn from the failures so that Netflix won’t fail the same way twice.

Chaos Monkey targets the scenario of VMs running using AWS’s auto scaling groups. By default, the auto scaling groups should automatically detect the termination of an instance, and replace it with a new identically configured instance. 

In addition to Chaos Monkey, Netflix created a Simian Army to verify HA mechanisms in response to other types of environment-based failures like an outage of an entire availability zone or the impact of introducing artificial delays REST services to simulate service degradation. The point is to proactively simulate disruption to test the implementation of recovery mechanisms.

Chaos Monkey has helped Netflix improve its resiliency against cloud outages. Netflix services ran without interruption and intervention (albeit with higher latency and higher than usual error rate) through an outage on 21 April. 

However, an approach like Netflix’s is just the first step, and does not cover many scenarios.  Environment-based failures go beyond those based on turning off VMs or adding delays.  Other failures to consider include the following: 

  1. Network failures may cause a set of VMs to be unreachable.
  2. Problems can be caused by overloaded VMs rather than completely dead VMs. 

  3. The purely random approach of turning off VMs does not guarantee that all VMs, or even the critical VMs, are tested. 

  4. Outages can still be caused by security vulnerabilities of the systems. 

  5. VM granularity is not enough:  not big enough to verify data centre or regional failures, nor small enough to verify degradation or failures of the services running on the VM.

Testing against a library of “what if” failure scenarios becomes an essential step in application deployments to cloud.  

To ensure reliability, applications on cloud require a new way to test that perturbs the underlying environment.  The role of the monkey is to cause problems and to create chaos.

Only then can you verify the effectiveness of your automated recovery mechanisms. When testing for cloud, introducing failures becomes an essential step.  You need to fail often so you don’t fail when it counts.  

Posted by Teresa Tung, Manager, Accenture Technology Labs and Qing Xie, Researcher, Accenture Technology Labs

Enhanced by Zemanta


More from Techworld

More relevant IT news


Send to a friend

Email this article to a friend or colleague:

PLEASE NOTE: Your name is used only to let the recipient know who sent the story, and in case of transmission error. Both your name and the recipient's name and address will not be used for any other purpose.

Techworld White Papers

Choose – and Choose Wisely – the Right MSP for Your SMB

End users need a technology partner that provides transparency, enables productivity, delivers...

Download Whitepaper

10 Effective Habits of Indispensable IT Departments

It’s no secret that responsibilities are growing while budgets continue to shrink. Download this...

Download Whitepaper

Gartner Magic Quadrant for Enterprise Information Archiving

Enterprise information archiving is contributing to organisational needs for e-discovery and...

Download Whitepaper

Advancing the state of virtualised backups

Dell Software’s vRanger is a veteran of the virtualisation specific backup market. It was the...

Download Whitepaper

Techworld UK - Technology - Business

Innovation, productivity, agility and profit

Watch this on demand webinar which explores IT innovation, managed print services and business agility.

Techworld Mobile Site

Access Techworld's content on the move

Get the latest news, product reviews and downloads on your mobile device with Techworld's mobile site.

Find out more...

From Wow to How : Making mobile and cloud work for you

On demand Biztech Briefing - Learn how to effectively deliver mobile work styles and cloud services together.

Watch now...

Site Map

* *