|The leftmost month above is July 2010 and the leftmost is Jun 2011. The vertical axis shows availability for each month in terms of a percentage. Our goal is to hit or exceed 99% availability each month.|
The main culprit of downtime throughout fiscal year 2011 was due to defects in software. As we have been retooling our technology environment from Perl and CGI scripts to Java applications, we have been making greater use of systems like Hibernate and Lucene. My sense is that we're relying more and more on open source middleware, and while that has the advantages of making it easier to develop software quickly, it also means that a problem in the underlying middleware can affect our overall availability. Some of this is due to buggy software; some is due to our learning curve on how to use the software properly; and, some of this is due to getting our arms around the optimal configuration and operation of these packages.
The January 2011 availability level - our lowest month of availability- was due largely to two problems. One was that we scheduled a maintenance window in our server room so that University of Michigan electricians could wire up a new "whole room" uninterruptible power supply, and this, of course, took our production web systems off-line. The other problem was that our regular synchronization process between our production systems and our cloud-based replica had failed in an unusual way that was difficult to detect at first. The database export/import had failed, but only partially, and that produced very odd behavior with our web portal. It took a significant amount of time to isolate the problem, and by the time we had a workaround deployed, the electricians had finished their work, and the production systems were back on-line.