Friday, April 27, 2012

Disaster Recovery at ICPSR : Part 3

Part 2 described the virtual infrastructure we built in Amazon Web Services to deploy a replica of ICPSR's content delivery system.

Monitoring the replica system

This turns out to be pretty tricky.

The University of Michigan Network Operations Center (NOC) monitors both our physical servers located in the ICPSR machine room and our virtual servers in Amazon Web Services (AWS).  Monitoring the physical servers is very straight-forward, but monitoring the virtual servers is not.

For the virtual machines we need to pick a URL which does not require authentication or a cookie, and which will not be redirected.  We also need a URL that points to a simple page so that the monitoring system does not grab page elements from the production web server rather than the replica.  In practice we have found the barriers to be so plentiful and so daunting we have, in fact, done a pretty crummy job of keeping an eye on the health of the replica.

Until recently.

We now run an additional instance in AWS which has one sole purpose:  monitor the replica system.  And to make this fool-proof, we add the same little "lie" to /etc/hosts on the monitoring machine and point names like www.icpsr.umich.edu and www.cceerc.org to the cloud replicas rather than the production systems.  This has worked very well for us so far in 2012.

Initiating failover

Imagine that the ICPSR "on-call" has just received a series of SMS messages on the on-call cell phone.  Everything at ICPSR is down, and the campus alert system reports that the Perry Building (ICPSR's home) has lost power.  There is no estimate time for repair.  The world looks like this:


We initiate the failover procedure by changing the DNS CNAME records for www.icpsr.umich.edu and www.childcareresearch.org.  Instead of "pointing" to the physical machines in the Perry Building, we point them to the cloud replicas.  If the failure did not include the production DNS server, we would make the change there.  However, in this scenario, the entire building has lost power, and so we need to make the change on the stealth slave server in AWS.

Now, as it turns out, the stealth slave server is recognized as a master server by the other slave servers for ICPSR's domain:  one at University of Michigan central IT and one at the San Diego Supercomputer Center.  Once we make a change to our server here (or in the cloud) those slave servers will pick it up within a few minutes.  And once they do, web requests start hitting our replica system rather than the production system.  And so the world changes from this:


to this:


in just a few minutes.

We can reverse the failover by making the same simple DNS record change, but in reverse.  We change the pointer from the cloud back to the physical systems in the ICPSR machine room.

Next: Part 4: Our experience with the replica over the past three years

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.