Tuesday, August 9, 2011

Amazon outage bites ICPSR too

Amazon reported a problem with its US-EAST region yesterday evening (EDT time).  Their service health dashboard reported that instances (virtual machines) in that region were having problems connecting to the Internet.  That was definitely true of our stuff.

We first saw alerts for our (virtual) equipment in the US-EAST at 22:25 EDT.  At that time we lost connectivity to every single instance in the US-EAST, but could still reach a small number of instances we have in other regions.  This affected the cloud-based replica of our production web server, the Teaching With Data web portal, and our "social login" service.  This latter service runs on an Amazon US-EAST operated by a company called Janrain, and isn't part of the instances where ICPSR has direct control.

By 22:56 EDT all of our systems were again reachable from the Internet, and no further action on our part (restoring content, restarting the instance) was necessary.

I have not yet seen a post mortem from Amazon, but based on my time in the data networking biz, my guess is that someone (Amazon or one of its transit or peering partners) made a routing change which blackholed their US-EAST traffic.

All in all Amazon continues to do a very good job with their cloud infrastructure, but this is a reminder that one would need to replicate services across several regions if one was to build a service with a very high level of availability.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.