Wednesday, April 11, 2012

March 2012 Web availability

March 2012 was not kind to us.

Clicking the image will display a full-size chart.  But please don't.  It is too ugly.

The main culprit in March was a continuing problem with the reliability of the production web server.  The environment - cooling, electricity, humidity - was fine, and the individual web applications were also fine, but something is not quite right with the kernel.  I think.  (If you would like to join the team as our new Senior Systems Architect and help solve the problem, see my post from last week.)

In March we saw multiple outages, each lasting over an hour.  The script always went something like this:

  1. Load average increases by 5000-10000%
  2. One web application stops responding and logging
  3. KERN.INFO error messages from khugepaged and jsvc appear in syslog
  4. Attempt to restart web application
  5. Fail
  6. Attempt to restart all web apps and their containers
  7. Fail
  8. Attempt to reboot machine
  9. Fail
  10. Optional:  Drive into office if weekend or early morning
  11. Attempt to cycle power on machine
  12. Mix of foul language and prayer
  13. Repeat step #12
  14. Success - machine is working again
Because we use the cloud instead of local, physical servers for many services, and because we haven't had all that many times where the machine needed its power cycled to solve a problem, we don't have things set-up for remote power access.  We'd like to address that.  (If you would like to join the team as our new Senior Systems Architect and help solve the problem, see my post from last week.)

So here's the plan to have a better April:
  1. Disable khugepaged, hoping this might stop the machine from seizing up
  2. Drive faster to the Perry Building, hoping this might result in faster applications of turning the power off and on
  3. Hire the Senior Systems Architect, hoping that having a third pair of eyes on the problem might reveal its true cause and solution
  4. Mix of foul language and prayer, hoping it will ease the pain
And, more seriously, we have also updated a few apps (like Solr) to use local storage rather than NFS-mounted storage for their work, particularly if the app tends to do a lot of writing to the filesystem.  NFS seems to be part of the mystery too.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.