Sunday, July 26, 2009

With Every Cloud a Little Rain Must Fall

I really like the cloud. I really do.


But today at 4:09am I hated the cloud.

Here's why.


Jul 26 04:08:00 www dhclient: send_packet: Network is down
Jul 26 04:09:01 www dhclient: No DHCPOFFERS received.

This log snippet above shows the last bits of life for one of our instances, our cloud-based replica web server. This is the cloud version of a star going super-nova, or perhaps collapsing into a black hole. This message means that your instance is no longer on the network, and that means it isn't doing anything useful for you. It's still accumulating $0.10/hour or more of revenue for Amazon, of course.


The problem began during the previous afternoon when the instance went to renew its DHCP lease, as it does every eleven hours. (DHCP is a mechanism to auto-configure system attributes, and like many organizations, Amazon Web Services uses DHCP as the mechanism for virtual machines to get their network configuration).

Unfortunately for our instance, when it went to contact the DHCP server for a renewal, it never got an answer. This by itself isn't alarming; this sort of thing happens all of the time. The instance then got impatient as a five-year-old and starting bugging the DHCP server for a lease renewal. "Are we there yet?" "Are we there yet?" "Are we there yet?" Sadly the DHCP server never responded and after eleven hours of asking, "Are we there yet?" once per minute, our poor instance just gave up.

And that takes our story to just after 4am this morning. That's when the U-M Network Operations Center noticed that our replica was no longer on the network, and that's when their network management system generated a page to wake up the on-call.


I've learned that it's never a good idea to diagnose and trouble-shoot a problem while I'm extremely groggy, and so I took a little time to collect my wits before logging into the system as root and rebooting it. But reboot it I did.


Jul 26 04:26:59 www shutdown[13643]: shutting down for system reboot

And then a few minutes later, voila!, the replica was back online. If only I could fall back to sleep so quickly.

I really like the cloud. But since this is the second time we've experienced this problem in 2009 (and others have experienced it too), I really wish Amazon would fix this.

But until they do, I think I'm going to write a little utility that checks /var/log/messages to see if the instance is on the road to network meltdown. If it is, I'd rather know about it sooner rather than later so that I can reboot the instance during the day rather than the middle of the night.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.