Monday, April 30, 2012

Disaster Recovery at ICPSR : Part 4

Part 3 described how we activate the replica, and how it works.

Using the replica


We've used our replica system several times over the past three years.  Our usage falls into a couple of categories:

Scheduled maintenance.  There have been a couple of times where we've had scheduled maintenance, and we've pressed the replica into service.

The most recent instance was on February 12, 2012 when the campus networking guys upgraded the gear that connects ICPSR's home in the Perry Building to the backbone.  We executed the failover early on a Sunday, and then moved traffic back once we got the "all clear" signal.  This scenario tends to produce very good outcomes since we can plan for the transfer, and we aren't simultaneously trying to recover from some other problem.

Emergency failover.  The most common instances in this category are when the Perry Building loses power unexpectedly, and we need to move traffic over to the cloud replica as soon as possible.

This scenario also tends to have good outcomes since we can focus solely on the transfer, and there is relatively little we can do except wait for the power to be restored.  One complication can occur if the on-call engineer is not near a computer, and so there is a delay as s/he gets to the closest one.  Or, if the outage happens during the business day, we may need to execute the failover very quickly, before our UPS systems become drained.

Emergency non-failover.  This is the category that corresponds to those times when we actually do NOT press the replica into service, but should have in retrospect.

A common scenario is that we see an alert for a single service (say our Solr search engine), and we begin to troubleshoot the problem.  Initially we may not know whether the problem will be fixed in just a few minutes, or if it will turn into a multi-hour process.  My usual rule of thumb is to press the replica into service in 30 minutes have elapsed, and if it feels like we're not very close to solving the problem.

This can go very wrong, of course, if my "feeling" is wrong, and can go very, very wrong if my "feeling" is wrong and we are short-handed, and I'm the one who is knee-deep in troubleshooting.  It can be very easy to look up 90 minutes later and say, "Oops."

Managing the replica


In general managing the replica is very inexpensive and requires little monitoring (by humans).  We have found that the main effort occurs when we are making a major upgrade in a core piece of technology such as the hardware platform (32-bit to 64-bit), the operating system (RHEL 5 to RHEL 6), or the web server itself.  In practice it means that in addition to upgrading the staging and development environments at ICPSR, we also need to upgrade the replica environment as well, and so adds more of the same type of work, not a new type of work.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.