Amazon’s S3 Web Service, our #1 cause of failure
July 21st, 2008 by Lukas BiewaldFaceStat uses Amazon’s S3 service to store and serve most images. Today they went down for 7 hours. During this time, FaceStat was completely broken — users couldn’t upload or view images, which is the point of the site.
They first mention “elevated error rates” at 9:05am; but our own logs indicate they went down 20 minutes before that. We guess “elevated error rates” is the new euphemism for “we are completely f’d and taking you down with us.”
Using Amazon’s S3 has about the same cost and complexity as hosting the images ourselves, but we had thought that the reliability of Amazon would be significantly higher. But that now seems wrong. Here’s a chart of this month’s FaceStat downtimes by cause:
It wasn’t just us — dozens or hundreds of useful websites were down today. SmugMug, the photosharing site, was broken. Avatars didn’t show up in Twitter. Drop.io, the excellent service we use to send and receive data from clients, was completely down. Scribd had no document data, rendering the site basically worthless. We decided to make the best of it and replace the homepage with an embedded flash game — since every feature of our site is broken, why not give our users something else to do?
How many sites will start moving off of S3 after this? SmugMug says they’re still happy with Amazon. But we’re not sure if this is warranted. According to Amazon’s SLA, even if the July uptime rate is below 99%, we only get a 25% discount off future AWS costs. This comes nowhere close to compensating us for the headaches they’ve caused us. It’s astonishing that serving content off our own boxes can be more reliable than serving content off of Amazon.
Google App Engine, an Amazon Web Service competitor, appears to be just as bad if not worse.
(For other folks trying to figure out how to get off of Amazon, Park Place is interesting — a server that clones the S3 API, but hosts the content on your own machine.)
The decision whether to use a cloud service like S3 is more complicated than the hype makes it out to be. We have 150,000 images for about that many users, but it still fits on a single hard drive, so it’s not too hard to set up our own image system. But if you’re growing and have to start scaling, you start needing something like S3. On the other hand, S3 is supposed to be for small folks like us who don’t want to spend time worrying about administration of such a system. So in order for it to pay off, we need reasonably high guarantees of reliability. Amazon has failed to deliver.



July 21st, 2008 at 6:53 am
For gods sake, if your business is so fragile it cant withstand some downtime your product must be rubbish.
look at what people put up with for a product they love: twitter!
July 21st, 2008 at 7:18 am
I don’t think Park Place is a serious option, but you could look at mogilefs. It’s a very similar concept with some large production deployments.
July 21st, 2008 at 7:38 am
“For gods sake, if your business is so fragile it cant withstand some downtime your product must be rubbish.”
a 7 hour block is not “some” downtime. Its a massive truckload of downtime. If we went down for even one hour my inbox would be full and the phone would be ringing constantly. If your product is good people use it a lot and are very unhappy when its taken away.
July 21st, 2008 at 8:17 am
I’ve stopped using them long ago. The speeds were not impressive and it had frequent downtimes. I’ve moved everything to dedicated servers nowadays with high quality bandwidth and it costs less then S3 at high traffic :).
July 21st, 2008 at 8:22 am
Ben is an ass (or a troll). Anyone who’s had an online business knows how painful even a few moments’ downtime can be. You are entirely correct to be so outraged.
July 21st, 2008 at 8:53 am
You wouldn’t have considered by any chance that any system that gets that big is bound to fail at times. Even if you were using your own dedicated hosting service, listen because here is some news, it could fail. What impresses me is that you were naive enough not to design some kind of fallback* so that a hard failure like that could have been mitigated. You want cheap but you ask for NASA-style reliability, get a grip.
* You don’t mention it in your article as I assume that even it does exist it just didn’t actually help at all.
July 21st, 2008 at 10:08 am
Waaah waaah, quit whining you losers.
July 21st, 2008 at 12:04 pm
Instead of jumping ship entirely, wouldn’t it make more sense to take advantage of amazon s3 and what it’s been offering you but combine that with a backup plan. Like, install ParkPlace (or something like it) on some servers at a more traditional isp. If S3 goes down again your system could automatically cut-over to your backup. Sure you might be serving stuff a bit slower, but you’ll be serving stuff. If you switched to a normal isp entirely then you’d have exactly the same problem you just had with s3 if the isp went down for some reason. Much better to have multiple, disparate backups than be on any one system entirely.
July 21st, 2008 at 12:13 pm
You get what you pay for. If you want to pay to host your own stuff on redundant machines, all power to you. If not, then tough rocks.
July 21st, 2008 at 12:28 pm
And exactly what sort of SLA do you have with Amazon S3? What sort of guarantee against downtime? Oh, best effort? Well, then stop your bitching. If you want five-nines, you’re going to pay through the nose for it, as well you should, as it is not easy.
I’ve noticed web guys and bloggers don’t usually understand this. “Well if I can keep Wordpress running on my local machine for testing 24/7, then you datacenter guys should have no problems!”
July 21st, 2008 at 1:44 pm
Masukomi, you’ve hit the nail on the head here. To take it a step further, cloud services need to be ‘aware’ of the instances you are running and gracefully handle failover by moving themselves around as needed. One things blocking this happening right now is a way to quickly shunt the data from the Internet to an alternate location. DNS, load balancing, and databases haven’t matured to that point yet, but they are getting close.
July 21st, 2008 at 1:58 pm
Magic Man: Read what he wrote. The SLA is 99.9% monthly. That’s what we pay for. Not “best effort”.
July 21st, 2008 at 1:59 pm
Oh, maybe he didn’t write that explicitly, but still. Do some research.
July 21st, 2008 at 2:32 pm
you know, stuff goes down. while it is unfortunate that Amazon’s service was out for 7hrs, I’m sure they had staff on-location and more staff on call, working furiously to get the job done. if you were hosting your own content, and it dropped at say, 1am local time, would you even KNOW about it until about 7hrs later?
IT is a business of reducing risk and managing resources. although you may not be happy with Amazon’s service, I challenge you to provide yourself with 24/7/365 hosting.
now, with that said, I am a little unsure how one of the largest organizations on the web lost so much connectivity… terrorism? ;)
July 21st, 2008 at 2:40 pm
Lala, did he not get his service credit as it says in the SLA agreement? If he did, what is he bitching about? Downtime happens. Don’t put all of your eggs in one (very large) basket if uptime is imporant to you.
More typical webguy thinking. “If my computer can stay on and downloading stuff from the network, it must be really easy to keep a system on the Internet all of the time.”
From his post: “According to Amazon’s SLA, even if the July uptime rate is below 99%, we only get a 25% discount off future AWS costs. This comes nowhere close to compensating us for the headaches they’ve caused us.” Well, then obviously, you aren’t spending enough time/money/manpower to make sure that your content is always available. Stop being a cheap-ass without a failover plan. If it really cost as much as they say it did, it would be worth it to spend more on having multiple sites on multiple networks to avoid losing money. Apparently they just learned that lesson the hard way.
July 21st, 2008 at 2:53 pm
More webguy thinking “What’s with all this ‘web guy’ talk? bittermuch?”
July 21st, 2008 at 2:54 pm
How about using S3 when it’s up and having a local Park Place mirror you can swap out if needed?
July 21st, 2008 at 4:02 pm
Did amazon give you any explanation about why the service went down? Do you have any dialog with them at all? I don’t think I’d be concerned about 8 hours of downtime due to an equipment failure (when I ran a datacenter the best we could get from IBM was a 4-hour replacement part guarantee, so 8 hours of downtime for a failed RAID controller or something isn’t really unreasonable), but I *would* be concerned if the problem was tied to the software that provides the service; say, if another user was able to take the whole system down, or if something about the hosting environment was prone to error. I don’t think being angry about the downtime is going to be productive for anyone, but I do think finding out something about the weak points of your hosting provider can give you the power to make good choices for your application. If hosting it yourself is more reliable or at least more controllable for you, and that’s important to you and your business, then that’s what you have to do. I’m sorry it took a service outage for you to consider that problem, but at least now you know the answer that makes the most sense for your business.
July 21st, 2008 at 4:35 pm
Well at least Amazon S3 is back up and has not lost any data!
Nirvanix, an Amazon S3 competitor, lost half of MediaMax’s files and put them out of business:
http://www.techcrunch.com/2008/07/10/mediamaxthelinkup-closes-its-doors/#comments
What a disaster.
July 21st, 2008 at 4:41 pm
This is something any online business deals with, it’s always better to host it yourself but not really cost effective for a business that’s just starting. As a web business owner currently experiencing inexplicable down time I feel your pain.
If you’re not responsible for getting it to work again it does give you an excuse to go outside for a little while.
July 21st, 2008 at 5:46 pm
Guys, cloud computing is still what should be called beta. Massively parallel systems are difficult and this is going to take a while to get right. If you put all your data on one of these sites, or pin your business model to theirs, you are asking for trouble. Don’t put mission critical applications on new, unproven technology with no failover plan and then start whining that you aren’t getting what you wanted. It’s new, it’s cool, it’s going to be the future, but right now it’s still going through growing pains.
July 21st, 2008 at 9:55 pm
Amazon S3 Down for 7 Hours; S3 Clients Looking for Exit…
Lukas Biewald lays bare his frustrations with Amazon’s S3 service, particularly after the recent S3 service outtage that left his FaceStat business offline for more than 7 hours recently. Actually, Lukas has double posted on this issue - he has…
July 22nd, 2008 at 1:45 am
Yes. Did you read the Amazon SLA information before you started using their service? S3 going down is the fault of Amazon’s engineering team. Your site going down is the site of yours. You signed up and used a service that gives you tiny rebates for downtime rather than true uptime guarantees and you got what you deserved. Quit whining.
July 22nd, 2008 at 2:04 am
I agree that this deep and long of an outage is not acceptable, even for a service with tempered service level.
However, your “this months FaceStat downtimes by cause” chart is disingenuous and useless. Foremost because looking at the last weeks of data immediately after a huge outage skews the picture severely. Why not show us since launch? Or several other months’ charts for comparison? Also, I don’t get the sense that FaceStat’s load is so heavy as to
> It’s astonishing that serving content off our own boxes can be more reliable than serving content off of Amazon.
Hypothetically, yes. Necessarily? No.
Are you going to get much more reliable at serving content in the next few years? No. Is going Amazon S3? Probably. 99.99% is a primary design goal, and unless I’m misinformed, they’ve been hitting that most months. This is obviously a disaster for them, but be real.
Or, you know, jump ship and be sure to also write posts about how unreliable or expensive self-hosting is when it eventually bites you in the ass or becomes too costly to sustain.
July 22nd, 2008 at 2:06 am
Should’ve been:
Also, I don’t get the sense that FaceStat’s load is so heavy as to … stress your other hosting service to the point of failure much.
July 22nd, 2008 at 3:05 am
Get a grip !!! It’s not like you’re running a stock brokerage system on Wall St !
My experience is that it’s not hosting computers that are the reliability issue, but the ISP who provides the connection. Try getting any level of service guarantee is next to impossible. When I build “five nines” systems for clients, we end up putting data centers in multiple cities using multiple carrier connections. It costs a heck of a lot more than what Amazon charge for S3. It’s like you paid for a cheap car, so don’t whinge about the rattles.
July 22nd, 2008 at 3:14 pm
Have you considered using S3 as a backup provider with a local disk as cache? I can envision software that emulates a hard disk with a caching mechanism that copies to and from S3 in the background, as files are requested / uploaded.
I do think though that you haven’t load tested your own storage solution. I have a friend that works for Seagate and hard drives fail ALL THE TIME. It’s just nature of the beast.
July 22nd, 2008 at 7:15 pm
Look into Akamai, they’ll do this correctly (disclaimer: I’ve worked
for them). Or implement their system (quickly deteriorating DNS
endpoint lookups) yourself, it’s not hard if your scope is limited.
You can just have S3 be your primary and when it fails, switch DNS
over to your backup.
July 24th, 2008 at 8:48 am
Isn’t this one of the risks you take in exchange for the convenience of using a centralized system like S3? Unfortunate, but bound to happen and something you need to look into when performing risk assessment.
July 25th, 2008 at 2:48 am
I understand your frustration, however, anytime you used hosted services (even those from a trusted provider) you should plan on implementing some redundancy. I know this can be costly for small orgs, but it could have minimized the impact of this occurance.