Amazon's AWS: Not Too Big To Fail
Bigger is not necessarily safer.
Earlier this year, Amazon, that awesome river of e-commerce and cloud computing, sprung a leak. It made news.
When a keystroke “oops” from an AWS engineer took down the world’s largest public cloud for five hours on the last day of February, the glitch exposed a fundamental truth: the bigger you are, the harder you fall.
Those on the receiving end of that typo-generated AWS outage were not amused. It turns out that Amazon, the cloud industry’s Kong — with nearly a third of the market — is accountable neither to users nor internet at large. The moral is simple: bigger is not necessarily safer. Or as Yaron Haviv, co-founder of Israel-based big data cloud provider iguazio put it in SiliconANGLE, “the real question is: why have we created such a dependency on services such as AWS?”
AWS fell on the mighty and the masses alike. With an estimated one-third of all internet traffic passing through AWS servers, sites from Slack to Quora to the U.S. Securities and Exchange Commission were out of commission for much of prime time that Tuesday.
Keep in mind that this wasn’t the result of a hack. It was instead a self-inflicted wound. There was no volumetric attack, no nefarious geek from Moldova anywhere in sight. Said Haviv: “What [Amazon’s statement is] saying is that big chunks of the internet depend on just one or two local services to function.”
According to Gizmodo, “in theory, a series of fail-safes should keep the fallout from such errors localized, but Amazon says that some of the key systems involved hadn’t been fully restarted in many years and ‘took longer than expected’ to come back online. Amazon says that its S3 service is ‘designed to deliver 99.999999999 percent durability’ and ‘99.99 percent availability of objects over a given year.’ But when one piece of the infrastructure fails, AWS fails big.” And that makes Amazon a giant dust cloud.
With all due respect to the AWS organization, just try parsing this, from Amazon’s official statement on the outage: “The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate.”
Where mega providers like Amazon are concerned, no one knows what’s under the covers. People assume Amazon — and Microsoft and IBM, for that matter — are doing things the right way, but the lack of transparency is precisely the problem. Amazon hadn’t rebooted its systems in years?
AWS could have put customers in separate silos, but opted for one big pool in that geography. Mega providers go out of their way to use homegrown products, the design of which remain trade secrets, and any efficiency gains they achieve can be undone by an absurdly minor human error. Bottom line: just because Amazon and Microsoft are big doesn’t mean they’re safe. Amazon’s share of the public cloud market currently stands at +30%, with Microsoft at 9% and growing rapidly, and IBM SoftLayer at 7%. And, as recent earnings reports indicate, this rather large faux pas hasn’t hurt Amazon’s share price. But size and Wall Street wizardry don’t equal smart.
It does matter what products and architecture a provider chooses. Smaller companies are by definition much more transparent, much more open to demonstrating to customers that they’re in a safe place. Some AWS users now want to know how to use Amazon to protect themselves from Amazon’s failures. The irony is just too rich.
It’s conceivable that “fail-safe” no longer has any meaning. Aiming for market dominance, Amazon neglected not only the computing masses it wants most fervently to woo, but the corporate mandarins whose loyalty, up to now, has been unquestioned. In a climate where DDoS attacks are relentless and wreaking havoc, no provider — least of all the largest among us — can afford to allow wayward fingers to slam into the S&P, as it did earlier this year.
Fact is, multiple geographies need multi-provider redundancy. Where was AWS disaster recovery? It’s not a rhetorical question. The Sarbanes-Oxley Act mandates that businesses understand risk — like online outages — and take steps to ensure business continuity.
In my view, the lesson here is that the biggest players in the game need to clear the air, get out from under that dust cloud, and model both transparency and accountability. Nobody is too big to fail, and when the behemoths fall, they tend to land on anyone and everyone.