Ick.
This is not a good trend.
Our overall availability (i.e., all components are working properly) sank below 99.5% again in October. The main culprit was a nearly two hour period on October 12, 2011 when a series of common alerts turned out to have an uncommon cause. The oncall systems engineer went through our usual series of steps to bring the service back online, and while the steps seemed to help at first, it was clear that the fix was just temporary, and more diagnostic work was necessary. This series of events also happened at an inopportune time, just as many of us were in transit between the office and home (and then back to the office again).
We also had a problem with our search engine technology (Solr) late in the month, and that contributed another 46 minutes to our unavailability. (Other components were working fine, but search was not.)
My apologies to those of you who were trying to get some work done on our site last month, and got bit by either of these problems.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.