Monday, April 25, 2011
For crying out cloud it has been an interesting weekend
The timing of the outage wasn't that bad for us. To the extent we use the Amazon cloud to perform research as part of our NIH "cloud grant" losing access over a long weekend wasn't that bad. At worst we lost one day (Friday) and one afternoon (Thursday) where we wouldn't have been able to do everything we might have wanted.
We also got lucky that our production web site didn't suffer any problems - and wasn't scheduled for any maintenance - this weekend. We run our replica system in Amazon's cloud, and had we lost the main site here at ICPSR, we would have been in very bad shape.
We also keep an encrypted copy of our holdings in Amazon's cloud, and over the past few days we haven't been able to keep it sync'd with master copies from here. But since we have so many copies in so many different locations, this wasn't all that worrisome. In fact, if we synchronized content weekly instead of daily, and if we always performed the synchronization on, say, Wednesday, the outage would have been a non-issue for this purpose. (But, of course, if the weekly synchronization was performed on Saturdays instead of Wednesday, we'd be drifting even more out of sync.)
The primary pain point was with a production web site we run exclusively in the cloud, TeachingWithData.org. For all practical purposes this site was off-line from Friday morning (EDT) until the middle of the day on Saturday. That's not good. But it's also the case that the service was never designed for 24 x 7 production, and so it's not surprising that it could suffer an extended outage. Building in fault-tolerance costs more.
The biggest lesson for us is that we need to make sure that some of our oldest, most long-lived instances are moved to EBS-backed instances. This is something that we've been meaning to do for some time, and this serves as a reminder of why it would be good to make it a higher priority.