Friday, June 8, 2012

May 2012 web availability

Web availability was good, but not great, in May 2012:

Click to enlarge

Five main episodes of 25 to 59 minutes account for almost all of the 242 minutes of unavailability in May 2012.

On May 13 the production web server seemingly lost power, and it required an additional reboot and some TLC to bring the system back on-line fully (33 minutes).

On May 18 the search index on our CCEERC web portal became corrupted, and that disabled much of the usefulness of the site for nearly an hour (49 minutes).

On May 22 we saw the first of two episodes where the proxy (AJP) between Apache httpd and Apache tomcat faulted.  This did not recover on its own and required some help from the technology team.  This resulted in a medium-duration outage (27 minutes), and another similar fault occurred on the evening of May 31 (24 minutes).

On May 31 our production database server faulted, requiring a manual power-cycle, and also requiring the production web server to be rebooted (59 minutes).

My sense is that while we're in much better shape with regard to the instability caused by khugepaged, we are starting to see something a little amiss with the Apache proxy system.  It isn't clear to us at this time if the issue is faulty software or sub-optimized configuration on our part.

