I use Google Alerts to keep tabs on a variety of people, places, and things of all sort, and something interesting hit my Gmail inbox today about "Amazon Web Services":
Looks like Amazon Web Services’ Elastic Compute service went down for an extended period this evening.
Now this was news to me, especially since we host our production study search service in the AWS EC2, and our replica delivery infrastructure too. Both the University of Michigan Network Operations Center (NOC) and Merit (NOC) monitor our systems, testing availability every minute of every day. And whenever there is an outage, the on-call engineer gets a page (or many pages!). And I'm the oncall this week. :-)
So what really happened?
Here's a piece of the story from the AWS Service Health Dashboard:
7:33 PM PDT We wanted to give you a quick update. A lightning storm caused damage to a single Power Distribution Unit (PDU) in a single Availability Zone. While most instances were unaffected, a set of racks does not currently have power, so the instances on those racks are down. We have technicians on site, and we are working to replace the affected PDU. We do not yet have an ETA, but we expect to be able to recover the instances when we restore power. Besides these affected instances, all other instances, and all other Availability Zones, are operating normally. Users with affected instances can launch replacement instances in any of the US Region Availability Zones or wait until their instance(s) are restored.
Some instances in one of AWS's availability zones (e.g., they have three for the US alone) failed. That's not wonderful news, especially if one of the failed instances belongs to you, but it is hardly a failure of the entire EC2 service.
To me this is somewhat like ICPSR messing up an entry in a database, making one of our studies unavailable by mistake, and someone blogging that ICPSR's on-line delivery service went down.
Net net for me: Like with any story with a sensational headline, one always has to read the body of the text to get the real story. And preferably, read the story from various sources to triangulate on the reality of the situation.