Amazon's S3 - Simple Storage Service - is supposed to operate with a four nines service level; that's 99.99 percent availability. Last week it did not, and the two failure types and longish fix time illustrate how a storage service service can turn to storage starvation very quickly.
The S3 service is described by Amazon thus: 'Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers.'
It is cheap. You pay only for what you use. There is no minimum fee, and no start-up cost. The running costs are $0.15 per GB-Month of storage used and $0.20 per GB of data transferred.
The service quality is described with regard to availability and speed thus:
Store data durably, with 99.99 percent availability. There can be no single points of failure. All failures must be tolerated or repaired by the system without any downtime.
Amazon S3 must be fast enough to support high-performance applications. Server-side latency must be insignificant relative to Internet latency. Any performance bottlenecks can be fixed by simply adding nodes to the system.
Last week S3 users noted problems in their forum:
On January 4th, at 7.30pm: "Is there something going on with s3 over the past day or so? Puts and retrievals have been going slower recently, and today they're almost intolerably slow. This needs to be fixed ASAP."
Other users followed suit. Amazon's Brian Flood replied to the forum about forty minutes later: "We are looking into this. Thanks for your patience while we investigate."
In the middle of the night Flood posted another message: "This issue has been fixed. We'll follow up on this thread with more details, but in the meantime, if you're still experiencing difficulties, please post to this thread. Thanks for your patience, and our apologies for any inconvenience this caused."
Unfortunately the access problems weren't fixed. During the morning of the 5th January users reported file fetch errors such as this: "I've got the problem too.
In some cases 80% of the images on my pages are not loading. Five minutes later it's ok, then it is down again, etc..."
It was still failing in the afternoon: "I am becoming increasingly aggrevated by this. Yesterday S3 was horridly slow and now today it is randomly failing to load images for my site. We are using it for a production business site with many customers whose pages are failing to display properly now. This needs to be fixed!" Usw5rs were ex[eriencing Error 500-type messages.
Flood posted another message at 4:36pm on the 5th: "We are investigating the issues you are experiencing. Stay tuned to this thread for more information."
Some suspicion of Amazon's marketing claims regarding S3 surfaced in the forum: "Yes. I agree. Amazon needs to be more honest with us. I know there is no SLA, but they stated that the service is fault tolerant and expect to be up 99.99 percent of the time. Clearly that is not the case. AWS says that amazon.com uses the same service, I have seen multiple times where S3 is down and amazon.com is running quite well. "
Fixed the second time
At 9.15pm on the 5th Flood posted his second fix notice: "We believe this issue has been resolved. Look for more details here. If you continue to experiece problems, please post to this thread. Sorry for the inconvenience."
The S3 service suffered an outage lasting a little over 24 hours. Why was that?
The reasons became clear in a posting by the Amazon S3 team at 6:23pm on January 27th:-
"Dear Amazon S3 Developers,
"We wanted to provide a little more detail on the two related issues that affected some of our customers on Thursday and Friday.
"The Amazon S3 team has been adding large amounts of hardware over the past several weeks in order to meet and stay ahead of high and rapidly increasing demand. Unfortunately, our most recent hardware order contained several sub-standard machines. We have been working closely with our supplier to identify and resolve problems with the new hardware, but in the interim, we haven't been able to add as much hardware to our fleet as we intended.
"On Thursday morning, we discovered a networking problem in one of our data centers that was responsible for the high latencies that affected some of our customers. We would normally have routed around this data center while we restored it, which would have returned performance to normal within a few minutes. However, since we weren't carrying the excess capacity we would have liked (due to the hardware problem we mentioned above), removing a portion of our fleet from service was not a viable option, and we had to restore the data center in place. We took several steps to return the system to normal, including adding new hardware that had recently cleared our testing and burn-in process.
"These actions resolved the underlying problem, and performance returned to normal for several hours. However, even though we'd been rigorously testing all new hardware before adding it to our fleet, some of the new machines we added were defective and failed late on Thursday night. These machines appeared healthy and continued to accept requests, but were unable to process them. This caused the second issue: a period of heightened error rates for some of our customers. Removing the failed hosts from the system brought error rates back to normal.
"To be safe, we're removing all other hardware from the defective batch (even if apparently functioning well) and adding proven hardware from other parts of Amazon.com's fleet to replace the suspect hardware and increase our capacity buffer. We are also in the process of implementing software to detect and handle the unprecedented failure mode we saw on Thursday night.
"Thanks for your patience. We know that you and your businesses depend on Amazon S3, and apologize for any inconvenience this has caused. We remain committed to providing you with the customer experience that you expect.
"Sincerely, the Amazon S3 Team
There were good things about Amazon's response in general. First of all they did respond and responded quite well, if a little late with a full explanation. Amazon is also to be commended for making this forum public. Certainly the responses posted after the second fix notice were complementary and talked of rebuilding trust.
However, users of any hosted storage service will probably find this S3 customer's posting instructive: "We've switched to using s3 in production and we have millions of files on their servers now. We're paying a LOT of money for this service and need it to be stable and reliable. I'm not looking forward to moving everything off s3 to something else, but if it's not reliable, that's what we'll need to do."
It seems to me that a hosted storage service not offering guaranteed SLAs with monetary compensation for SLA failures ought to be thought about very carefully indeed, especially if you are thinking of committing production data to it.