Monday, April 23, 2012

Disaster Recovery at ICPSR - Part 1

I'll be running a series on disaster recovery planning (DRP) and execution at ICPSR.  I'm responsible for ensuring that we have a working disaster recovery plan for two key areas of ICPSR:  delivery of content via our web site, and preservation of content via archival storage.  The requirements and solutions of the two areas are quite different, and I'll address each one separately.


This first post will focus on disaster recovery for our web-based delivery system.


Background


After a particularly long outage (3-4 days) in late 2008 due to a major ice storm that knocked out the power to our building, ICPSR made the decision to invest in a disaster recovery plan for our web-based delivery system.  The idea was to create a plan which would allow my team to have the process and infrastructure in place so that we could recover from a disaster befalling our delivery system.  We defined "disaster" to be an outage which could conceivably last for many hours or even days.  And the goal was to be able to recover from a disaster within one hour.

It is important to note that we were not intending to build a "high availability" delivery system.  The goal of that type of system would be to move ICPSR into the so-called "five nines" level of availability, meaning that our infrastructure would be available at least 99.999% of the time.  Converting ICPSR's plethora of legacy systems and infrastructure into such a high availability system would be a major project requiring a significant investment over several years.

Instead we set the bar lower, but not too low.  What if ICPSR had a goal of 99% availability each month? In that scenario we do not need the level of investment and infrastructure to avoid almost all down-time; we only need to be able to recover from down-time quickly, and to prevent any long outages.  The investment to reach that goal would be much smaller, and it would serve our community well.


The Starting Point


At this point in time we already had reasonably robust local systems - powerful servers for web and database services, an enterprise-class storage system, and UPS backup for all of our systems.  In addition, the University of Michigan Network Operations Center (NOC) was monitoring our systems 24 x 7.  The NOC's network monitoring system (NMS) sent automated emails to us whenever a component faulted.

However, we did not have any sort of on-call rotation ensuring that a fault would be caught and corrected quickly, and we also did not have any backup or replica system which could be pressed into service if, say, our building lost power for several hours (or days).  So we were exposed to short outages becoming unnecessarily long, and to long outages where we had no control over the recovery time.

We were able to address the first issue quickly and effectively by establishing an on-call rotation, where the "on-call" served one week at a time and carried a cell phone which received SMS alerts from the NOC's NMS.  This meant that faults would now be picked up and acted upon immediately by someone on the ICPSR IT team.  This alone would eliminate one class of long-lived outages, for example, where a fault would occur late on a weekend night, but not be picked up for repair until Monday morning.

The next step was to design, build, deploy, and maintain a replica of our delivery system.  But where?

Next up:  Part 2:  Building the replica

1 comment:

  1. Great post. I am doing research on disaster recovery for a paper that I am writing. Thanks for the great information, it is very helpful!

    ReplyDelete

Note: Only a member of this blog may post a comment.