June 2012 was looking to be a great, great month for uptime. We were on track to have our best month since November 2011 this fiscal year - just 60 minutes of downtime across all services and all applications. It was going to be beautiful.
And then Amazon Web Services had another power failure.
And then we wept.
The power failure took the TeachingWithData portal out of action. (To be fair, it was already having significant problems due to its creaky technology platform, but this took it all the way out of action.) The failure also took our delivery replica out of action, and gave Tech@ICPSR the joy of rebuilding it over the weekend.
But the real trouble was with a company called Janrain.
Janrain sells a service called Engage. Engage is what allows content providers (like ICPSR) to use identity providers (like Google, Facebook, Yahoo, and many more) so that their clients (like you) do not need to create yet another account and password. Engage is a hosted solution that we use for our single sign-on service using existing IDs, and it works 99.9% of the time.
However, this hosted solution lives in the cloud. We just point the name signin.icpsr.umich.edu at an IP address we get from Janrain, plug in calls to their API, and then magic happens.
Except when the cloud breaks.
Amazon took Engage off-line for nearly four hours. And then once it came back up, it was thoroughly confused for another three hours. Ick.
So, counting all of that time as "downtime" our fabulous June 2012 numbers suddenly became our awful June 2012 numbers. Here they are:
If you click on the image above, Blogger will make it bigger.
Of course, during a lot of that downtime, all of the features on the web site except for third-party login worked fine. And most of the problem happened late on a Friday night and Saturday morning during the summer, so that's a good time for something bad to happen, if it has to happen at all.