Cloud’s on-demand value proposition is attractive, but there remain concerns around its reliability. Indeed service levels offered by most providers are basic—prominent providers promise 99.95% availability. Also service response varies.

Our recent measurements of a widely known Infrastructure-as-a-Service (IaaS) provider find that the time-to-start an instance, that is, the time from requesting a new virtual machine instance until it is ready to use, varies from a couple of minutes for a Linux machine, to between 10-20 minutes for a Windows machine. The I/O performance varies by a factor of 6, and the network by a factor of 10. Cloud can be predictably unpredictable.

But this unpredictability may be unacceptable, especially for those applications with stringent service level agreements (SLAs). What is a cloud consumer to do? We can push IaaS providers to guarantee the needed SLA outright, which may come at a steep price if it is even possible. Or, we can architect our application to achieve the needed SLA at the application level while using the commodity cloud components at the infrastructure level.

Let’s borrow a page from playbook of content delivery networks (CDNs), such as Akamai or Limelight Networks. These CDNs monitor network conditions of various service providers. The CDNs then meet custom transport guarantees by properly determining what Internet links to use and how much to use them. The usage assignment is not static, but adapts to the real-time measurements of the underlying network links. To meet custom SLAs via commodity cloud, we can extend the same concept of the overlay to enterprise applications.

First, we need to know what the cloud conditions are and how they vary over time. To find the state-of-the-cloud, simply use the cloud: start a few virtual machine instances, ping the network, run some workloads. This approach is based on hosting the same webpage workload across many cloud providers and then monitoring the availability and response across geographies (important since sometimes the wide-area network is going to be the bottleneck). This information determines the best provider to use which will change based on where the users are and when we want to use the service.

We can then adjust enterprise applications according to our measurements. Operations like on-demand provisioning are essential design elements that can be used to combat unreliability and instability of the lower level cloud. For example, we can add scaling rules like "if system utilisation hits 80%, then start one more new instance". Or a detect-restart rule that monitors running nodes, and automatically restarts nodes when a failure is detected.

Indeed by themselves, these rules do not relate to service level guarantees. But taken together with measurements of the underlying IaaS, the thresholds and settings of the rules relate to application-level SLAs.

For example, to meet demand with high probability, instead of always triggering scaling at 60%, we trigger at 50% when underlying IaaS reliability degrades. Or instead of starting one new instance, we start three when the time-to-start an instance gets longer. We need to adapt the trigger and the response to the actual measurements of the state-of-the-cloud. In this scaling case, we adapt based on the time-to-start an instance, the reliability of the capacity of each instance, and the demand.

Let’s take a look at a specific scenario. We need a service with 99.999% availability but the underlying infrastructure only provides 99.95% availability. What can we do? Here are our options:

  1. High Availability: Using two independent 99.95% commodity services will satisfy 99.999% (to be precise the availability is 1 - (1-0.9995)^2).
  2. Detect-Restart: Consider the detect-restart rule that restarts a machine when it fails. Taken in conjunction with estimates of the mean-time-before-failure (MTBF) of and the mean-time-to-recover (MTTR) an instance, we can map the availability (MTBF/(MTBF + MTTR)) to the SLA. Using the scenario above, this scheme achieves 99.95% for Windows and 99.99% for Linux.

In other words, we can meet 99.95%, 99.99% and 99.999% under various schemes based on different underlying provider conditions. Using the measured conditions, we compare the high availability option with one that results in slightly less availability but also reduces the provisioned resources by half—are the extra 9s worth it?

Cloud can be predictably unpredictable, but we can meet high availability goals if we architect applications in the right way. Armed with state-of-the-cloud information, we can map our operations rules to meet custom SLAs. We adapt the parameters of our rules as the state-of-the cloud varies to maintain fixed SLAs. In this case, the best defence is a good offence: If our virtual machine goes down, we turn it back on. If the capacity drops, we get another one. If our current provider doesn’t have one or my provider goes down, we use another provider. As the cloud consumer, we monitor conditions to determine what to do and when to do it.

Why is there this service variation? It has something to do with shared resources being consumed by disparate workloads. But really, who cares? The question of "why" is better left for the cloud provider who can actually do something about it. Instead for the cloud consumer the question should be "what are you going to do" to thrive in the predictably unpredictable.


Posted by Teresa Tung, Ph.D