Wednesday, June 20, 2012

Amazon power outage and Amazon

Amazon suffered a power outage in their northern Virginia data center last week.  Here is my abridged timeline of events from the Amazon Service Health Dashboard:

Jun 14, 8:50 PM PDT We are investigating degraded performance for some volumes in a single AZ in the us-east-1 region.
Jun 14, 10:29 PM PDT We can confirm a portion of a single Availability Zone in the US-EAST-1 Region lost power. We are actively restoring power to the effected EC2 instances and EBS volumes. We are continuing to see increased API errors. Customers might see increased errors trying to launch new instances in the Region.
Jun 15, 12:11 AM PDT As a result of the power outage tonight in the US-EAST-1 region, some EBS volumes may have inconsistent data. As we bring volumes back online, any affected volumes will have their status in the "Status Checks" column in the Volume list in the console listed as "Impaired." You can use the console to re-enable IO by clicking on "Enable Volume IO" in the volume detail section, after which we recommend you verify the consistency of your data by using a tool such as fsck or chkdsk. If your instance is stuck, depending on your operating system, resuming IO may return the instance to service. If not, we recommend rebooting your instance after resuming IO.
Jun 15, 3:26 AM PDT The service is now fully recovered and is operating normally. Customers with impaired volumes may still need to follow the instructions above to recover their individual EC2 and EBS resources. We will be following up here with the root cause of this event.
And, indeed, Amazon did follow-up on the root cause of the problem.  Based on the post-mortem that has been reported in several venues, the root cause was a fault in commercial power.  And a generator.  And an electrical panel.  One view is that Amazon got very unlucky with power problems; another view is that they did not test their fail-over thoroughly enough.  I lean more to the former view.

ICPSR didn't suffer any outages.  For example, our cloud-based replica was available to us the entire time.  We did receive notifications from Amazon that specific EBS volumes (basically a virtual block device that may be attached to a cloud-based machine) may have been corrupted, and should be inspected.  Amazon included the specific volume.  Here's an example notification:
Dear ICPSR Technology ,
Your volume may have experienced data inconsistency issues due to failures during the 6/14/2012 power failure in the US-EAST-1 region. To restore access to your data we have re-enabled IO but we recommend you validate consistency of your data with a took such as fsck or chkdsk. For more information about impaired volumes see:
EBS Support
So this did create a bit of unscheduled work for the technology team because we had four affected volumes.

One was not attached to anything, and was not in use.  

One was attached to a machine we had recently retired.

But two were attached to a machine that stores an encrypted copy of our archival holdings.  The volumes are each 1TB and part of a multi-TB virtual RAID.  This makes for a very, very long-running fsck to inspect for problems.

I'll have the conclusion on Friday.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.