This first post will focus on disaster recovery for our web-based delivery system.
It is important to note that we were not intending to build a "high availability" delivery system. The goal of that type of system would be to move ICPSR into the so-called "five nines" level of availability, meaning that our infrastructure would be available at least 99.999% of the time. Converting ICPSR's plethora of legacy systems and infrastructure into such a high availability system would be a major project requiring a significant investment over several years.
Instead we set the bar lower, but not too low. What if ICPSR had a goal of 99% availability each month? In that scenario we do not need the level of investment and infrastructure to avoid almost all down-time; we only need to be able to recover from down-time quickly, and to prevent any long outages. The investment to reach that goal would be much smaller, and it would serve our community well.
The Starting Point
However, we did not have any sort of on-call rotation ensuring that a fault would be caught and corrected quickly, and we also did not have any backup or replica system which could be pressed into service if, say, our building lost power for several hours (or days). So we were exposed to short outages becoming unnecessarily long, and to long outages where we had no control over the recovery time.
We were able to address the first issue quickly and effectively by establishing an on-call rotation, where the "on-call" served one week at a time and carried a cell phone which received SMS alerts from the NOC's NMS. This meant that faults would now be picked up and acted upon immediately by someone on the ICPSR IT team. This alone would eliminate one class of long-lived outages, for example, where a fault would occur late on a weekend night, but not be picked up for repair until Monday morning.
The next step was to design, build, deploy, and maintain a replica of our delivery system. But where?
Next up: Part 2: Building the replica