After years of hype, the IT industry finally had a rude awakening this spring, reminding us that cloud computing infrastructures are vulnerable to the same genetic IT flaw that plagues traditional data center operations: Everything fails sooner or later.
In March, an 8.9 earthquake and subsequent tsunami caused widespread disruptions to power supplies and network connectivity to data centers across Japan, causing Japanese companies to rethink their traditional disaster recovery strategies. Several weeks later, the EBS system in one of Amazon’s EC2 data centers in the Eastern U.S. failed due to a faulty router upgrade and a cascade of resulting events, sent hundreds of customers—including many Web 2.0 companies such as Foursquare and Reddit—scrambling in an effort to resume services.
Ironically, these events also highlight how cloud infrastructures, when managed correctly, actually provide unprecedented capabilities to deliver high availability, resiliency and business continuity in IT operations.
Planning for Failure in the Cloud
Protecting your organization from unplanned downtime is widely dependent on building redundancy and diversity directly into your disaster recovery and business continuity systems. Business systems need to be able to run on a number of different infrastructures — whether they be public clouds such as Amazon or Rackspace, or private clouds using traditional on-premise hardware — and be able to fail over between them quickly and efficiently as necessary.
Despite the Amazon outage, public clouds now provide organizations with an impressively wide array of options to implement business continuity at a level of affordability that simply did not exist a few years ago. Consider this: Right now, from my laptop, I can launch servers in a dozen disparate locations worldwide – including the U.S., Europe, and Asia – for pennies per hour. As a result, I can design a system for my business that can reasonably withstand localized outages at a lower cost than previously possible.
The key is to design your infrastructures for the possibility of failure. Amazon’s CTO Werner Vogels has been preaching this religion for many years, suggesting the only way to test the true robustness of a system is to ‘pull the plug.’ Netflix — itself a major cloud infrastructure user — has created a process it calls “the Chaos Monkey” that randomly kills running server instances and services just to make sure the overall system continues to operate well without them. Not surprisingly, Netflix’s overall operation saw little impact from the AWS U.S. East outage when it occurred.
Implementing failure-resilient systems isn’t easy. How can you quickly move your operations from one infrastructure to the next when the pressure is on and the alarm bells are ringing? How do you design a system that not only allows new compute resources to begin to operate as part of your service, but also folds in an up-to-date copy of the data your users and customers depend on?
Redundancy and Automation in the Cloud
There is, of course, no magic bullet. But there is a general approach that does work: combining redundancy in design with automation in the cloud management layer. The first step requires architecting a solution that uses components that can withstand failures of individual nodes, whether those are servers, storage volumes, or entire data centers. Each component (e.g. at the web layer, application layer, data layer) needs to be considered independently, and designed with the realities of data center infrastructure and Internet bandwidth, cost and performance in mind. Solutions for resilient design are almost as many and varied as are the software components they utilize. For example, databases alone comprise a wide range of approaches and resiliency characteristics, including SQL, NoSQL, replication, caching technologies, etc.
But the secret sauce really comes in how your architecture is operated. What parts of the system can respond automatically to failure, what parts can respond nearly automatically, and which not at all? To be more specific, if a given cloud resource goes down — be it a disk drive, a server, a network switch, a SAN, or an entire geographical region — how seamlessly can you launch or fail over to another and keep operations running? Ideally, of course, the more that;s automated (or nearly so), the better your operational excellence.
Achieving that level of automation requires your system design and configuration be easily replicable. Servers, for example, need to be quickly re-deployable in a predictable fashion across different cloud infrastructures. It’s this automation that gives organizations the life-saving flexibility they need when crisis strikes. Our own RightScale ServerTemplate methodology, as an example, provides this re-deployment capability that allows a server, if brought down from an outage, to be launched in another cloud in a matter of minutes.
Customizable Best Practices in the Cloud
The right cloud management solution should simplify the process of launching entire deployments through customizable best practices. It should also provide complete visibility into all infrastructures through a central management dashboard – a ‘single pane of glass’ – through which administrators can monitor performance and make capacity changes based on real-time needs. The same automation and control that gives organizations the ability to scale up or down using multiple servers when demand increases also allows them to migrate entire server deployments to a new infrastructure when disaster strikes.
The fallout from the Japanese earthquake and Amazon outage is being felt throughout the business community and is causing organizations to rethink how they ensure business continuity. Cloud architecture provides the distributed structures necessary to counteract regional disasters, but companies also need the cloud management capabilities necessary to fail over their operations to multiple infrastructures in a way that keeps things up and running.
Some may have thought the cloud was a magic bullet. It’s not, and that’s actually good news. By recognizing one of the original founding principles of cloud architectures — that everything fails at some point — businesses are now in a position to design and build services that are more resilient than in the past, at a fraction of the cost. With the right architecture and management layer, cloud-based services can provide unparalleled disaster protection and business continuity.
Michael Crandell is CEO of RightScale, the leader in cloud computing management.
Related content from GigaOM Pro (subscription req’d):
- Infrastructure Overview, Q2 2010
- The Structure 50: The Top 50 Cloud Innovators
- Infrastructure Q1: IaaS Comes Down to Earth; Big Data Takes Flight