By Ian Withrow
Those who know me well can tell you that I’m hardly a frothy fan boy, indeed I’m a died in the wool skeptic. So it may come as a surprise to you to hear that I view the fallout of the recent Amazon Web Services (AWS) outage as a very positive sign for Cloud Computing. Sure some sites got taken down, including one of my personal favorites Quora. However, another favorite site of mine managed to survive the incident with comparatively minor hiccups: Netflix. This is the bright spot I want to highlight. I just happened to have a performance measurement for Netflix in my Keynote account. On the east coast starting at 12am April 21st, Netflix’s performance for successful transactions stayed a consistent couple of seconds and was available 96% of the time. Granted this isn’t flawless execution, note that the 27 failed data points are all timeouts resulting in just a red screen. However, compared to what happened to many sites, this is outstanding. (Y-axis details obscured)
It’s not dumb luck that got Netflix off this easy. It’s the product of hard work and engineering time invested in building their Amazon Web Services deployment the right way. As Netflix has been touting in various cloud conferences this year, they’ve been forced to fully embrace AWS due to their tremendous growth. Basically they only run credit card transactions in their private network. To ensure they always have enough capacity (and incidentally are highly available) they have turned provisioning decisions over to their operational systems. Whenever an Amazon instance is poorly performing they terminate it and get a new one. Likewise if there is an availability zone acting up (like what happened) then they automatically switch over to another.
This is how real high availability has always been done in networking: ensure that you can automatically failover to logically, physically, and geographically separate resources. Any real engineer will tell you that problems and failures will happen. Your availability track record is not based on how frequently this occurs but how gracefully you recover from them.
Herein is the promise of Cloud Computing: namely the favoreable relationship between cost and failover capabilities. In a private network world you would have to build and pay for a lot of stuff yourself: multiple data centers, double the hardware, internet access connections on opposite sides of the building, etc. Very quickly the cost of high availability gets prohibitive, locking out all but the deepest of pockets. Netflix explicitly said at Cloud Connect they came to the conclusion that they, even with all their growth, just weren’t big enough to justify building their own network of redundant data centers.
Enter Cloud Computing. Now having access to redundant data centers is just a matter of purchasing the right performance monitoring tools and the engineering time in programming your applications and operational systems to take full advantage of on demand resources. In the end you only pay for what you use of the infrastructure, not what you might need as is the case when doing it yourself. That’s what the real shame and promise highlighted by this outage is, young companies like Quora and Foursquare could easily have done just what Netflix has done. The barrier to entry here isn’t a huge budget but the knowledge and priorities to do the work. The next step of course after fully leveraging Amazon is to be able to failover to different cloud providers, I’d bet you $100 Netflix is working on exactly this right now.
In a way this drives home a point we’ve known all along. Cloud Computing is not outsourcing, this implies a transfer of risk and responsibility. You, not Amazon or Microsoft or Google etc., are responsible for the performance of your applications whether they are in the cloud or not. Cloud Computing is a powerful tool to increase performance and availability many fold while reducing costs, if it’s used correctly. If you don’t use the tool properly then an outage isn’t Amazon’s fault, it’s yours. I'll leave you with this thought, Amazon seems to agree: according to Gartner Analyst Lidya Leong this isn’t an outage that generates service credits. (Quote at very end of article)