Wednesday, January 18, 2012

Disaster Recovery v. High Availability

A question I often receive from customers and colleagues is:  If ICPSR has a replica of its production delivery system in Amazon's cloud, why is it that the web site is sometimes down due to scheduled maintenance or unplanned outages?

The short answer is:  ICPSR's cloud replica serves a disaster recovery (DR) purpose, but not a high availability (HA) purpose.  Of course, more often than not, this generates a look that falls somewhere between Bah! and This sounds like some made-up IT nonsense!  However, it really is the answer.  But that begs the question:  What's the difference between DR and HA?  But first a trip back in time....

As some long-time ICPSR clients may recall, the ICPSR delivery system was off-line for nearly a week during the holiday break between 2008 and 2009.  The root cause was a long power outage due to a major ice storm in the Midwest which knocked out power to many homes and businesses, including many in Ann Arbor.  And because ICPSR resides in a building just a little bit off the University of Michigan's central campus, we're just like any other home or business that waits for DTE Energy to restore power.

As one might expect both myself and the ICPSR Director at the time, Myron Gutmann, were quite anxious for the power to be restored.  The storm had caused so much damage that it wasn't at all clear when the building's power would be restored.  And, after the first few days without power - and heat - the building's pipes were in danger of bursting.  Things were looking pretty bad.

However, as it turned out we had been experimenting with Amazon's new computing and storage cloud just prior to the storm.  It would be pretty easy to stand up a minimal web server in Amazon's cloud, something that would basically say Yes, we know our delivery system is down, and we're sorry about that.  And here's the best guess from the local power company about when power will be restored.  We then worked with some of our colleagues at the University of Michigan and the San Diego Computing Center to update the system that maps names (like to network addresses so that ICPSR's URLs for its web site would point to this new, minimal web server in Amazon's cloud.  That didn't fix the problem, of course, but it let people know that ICPSR knew there was a problem, and shared the best information we had about the problem.

Once power was restored and the main delivery system came back on-line, I had a long conversation with Myron about how we wanted to position ICPSR for any future problem like this.  What if the building lost power again for an extended period?  What if a tornado knocked down the whole building?  What if the U-M suffered some catastrophic problem with its network?

One option was to change the architecture of ICPSR's delivery systems.  Rather than having a complex series of simple web applications, we could redesign and rebuild the whole system so that it would also contain a middle layer of technology that would catch and route incoming requests to one of many delivery system components.  And rather than having a single production system at the University of Michigan, we would build a multi-site production system spread across multiple network providers and service providers so that no single problem would disrupt services.  This is essentially the high availability (HA) version of ICPSR's delivery system.  It would have the virtue of providing true 99.99%+ reliability, but would cost plenty of money to design, build, and operate.  If you are running IT systems for a bank or a hospital or an aircraft carrier, you build them with HA.  But what about a data archive?

Another option was to keep the ICPSR delivery architecture the same, but replicate it somewhere off-site.  Automated jobs could keep the web content, data content, and web applications synchronized.  And an easy - but manual - process could be used to redirect traffic to the replica when needed.  In this world there would still be plenty of times where a component of ICPSR's delivery system might be off-line due to maintenance or a fault, but if the maintenance or fault was long-lived, then the replica could be pressed into service.  This type of solution would be inexpensive to design, deploy, and operate, and would deliver a credible disaster recovery (DR) story, but would probably only give us uptime somewhere between 99.0% and 99.9%.  Would that be good enough?

In the end, of course, we decided that the best use of resources would be to build a system that would still have some outages from time to time, but which would never again be off-line for an entire week.  We set an availability goal of 99.5% for each month across all components.  That is, every time a single component faults - search, download, online analysis, and so on - it counts against the uptime of the WHOLE system.  And we would leave it up to the judgement of the on-call engineer to decide when a problem was likely to be long-lived enough to warrant a switch to the replica.

So we chose DR instead of HA.

Looking back, my sense is that we made the right decision.  In practice we seem to hit our 99.5% availability goal most months, and because we did not tie up our software and systems development resources on rebuilding the delivery system to guarantee HA, we were able to design and build systems like our Restricted Contract System, Secure Data Environment, and Virtual Data Enclave.  Of course, when we need to perform a major bit of maintenance like last weekend where it is important that we continue to point at the production system rather than the replica, it always makes me wonder about the HA alternative.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.