Technology at ICPSR: TRAC: C3.4: Disaster preparedness

C3.4 Repository has suitable written disaster preparedness and recovery plan(s), including at least one off-site backup of all preserved information together with an off-site copy of the recovery plan(s).

The repository must have a written plan with some approval process for what happens in specific types of disaster (fire, flood, system compromise, etc.) and for who has responsibility for actions. The level of detail in a disaster plan, and the specific risks addressed need to be appropriate to the repository’s location and service expectations. Fire is an almost universal concern, but earthquakes may not require specific planning at all locations. The disaster plan must, however, deal with unspecified situations that would have specific consequences, such as lack of access to a building.

Evidence: ISO 17799 certification; disaster and recovery plans; information about and proof of at least one off-site copy of preserved information; service continuity plan; documentation linking roles with activities; local geological, geographical, or meteorological data or threat assessments.

Building and documenting systems and procedures for coping with a disaster has a scope well beyond just IT. But there are two key areas worth discussing that fall within the purview of IT.

One area is ensuring that ICPSR is able to deliver its content to its clients, members, and the public at all time. This is an area where we've made significant investments over the past twelve months, and where we also now have a good story to tell.

The main ICPSR delivery mechanism is its web site. The technological resources that power the primary instance of the web site reside at ICPSR itself on the campus of the University of Michigan. This consists of three mains systems: a reasonably powerful server running web applications; another powerful server running an Oracle database; and our storage appliance.

Our equipment resides in an eclectic machine room. On the plus side it has items one would expect to find like equipment cabinets, local air handlers providing chilled air, and UPS to protect us from power fluctuations. On the negative side there is no raised floor or cable trays, which makes for a messy machine room, and our connection to Ann Arbor's (not U-M's) electrical grid is somewhat precarious. An off-site network operations center monitors our gear 24 x 7 and notifies us via SMS, pager, and telephone if anything looks broken.

We maintain a replica of our web environment in Amazon's cloud, and we use a simple mechanism to trigger a failover to the replica: The oncall technician changes the DNS record for www.icpsr.umich.edu to point to the replica instead of the primary. The time-to-live on the record is low (five minutes), and so once the process has been followed, failover is quick. (And the change is made on a "stealth" DNS server that also lives in Amazon's cloud.)

And, finally, we synchronize the replica several times throughout the workday so that software and content is always fresh.

This setup doesn't create a web environment which promises "five nines" type of uptime (i.e., 99.999% availability), but it does give us the capability to avoid any long multi-day outage like we saw late in 2008, and it also gives us the capability to deliver content indefinitely from the replica should ICPSR stuffer a disaster.

The second main area where IT plays a key role is with archival storage. This is less about 24 x 7 availability, and more about ensuring that our archival holdings are available for long-term access in a robust storage fabric.

A post from November 2009 is still an accurate depiction of how we replicate our archival holdings so that we can be guaranteed to have a copy even if something catastrophic happens to our main location in Ann Arbor. I'm also interested in deploying copies outside the United States. We've had some very useful conversations with colleagues at the ANU Supercomputer Facility in Australia, and perhaps some sort of reciprocal storage arrangement might be worked out.

Technology at ICPSR

Friday, March 5, 2010

TRAC: C3.4: Disaster preparedness

No comments:

Post a Comment