Google+ Followers

Wednesday, July 6, 2011

ICPSR web portal availability in 2010-2011

It's that time again:  the end of another fiscal year.  And that means it is also time for my annual summary of ICPSR web portal availability.

The leftmost month above is July 2010 and the leftmost is Jun 2011.  The vertical axis shows availability for each month in terms of a percentage.  Our goal is to hit or exceed 99% availability each month.
All in all it was a pretty good year for ICPSR's production web portal.  Our web portal hosts many different sites (ICPSR proper, NACJD, NACDA, SAMHDA, DSDR, CCEERC, the ICPSR Summer Program, and many more sites).  We were able to exceed 99.75% availability most months, and only had two months (January and June 2011) where our level was a bit lower.

The main culprit of downtime throughout fiscal year 2011 was due to defects in software.  As we have been retooling our technology environment from Perl and CGI scripts to Java applications, we have been making greater use of systems like Hibernate and Lucene.  My sense is that we're relying more and more on open source middleware, and while that has the advantages of making it easier to develop software quickly, it also means that a problem in the underlying middleware can affect our overall availability.  Some of this is due to buggy software; some is due to our learning curve on how to use the software properly; and, some of this is due to getting our arms around the optimal configuration and operation of these packages.

The January 2011 availability level - our lowest month of availability- was due largely to two problems.  One was that we scheduled a maintenance window in our server room so that University of Michigan electricians could wire up a new "whole room" uninterruptible power supply, and this, of course, took our production web systems off-line.  The other problem was that our regular synchronization process between our production systems and our cloud-based replica had failed in an unusual way that was difficult to detect at first.  The database export/import had failed, but only partially, and that produced very odd behavior with our web portal.  It took a significant amount of time to isolate the problem, and by the time we had a workaround deployed, the electricians had finished their work, and the production systems were back on-line.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.