Monday, April 25, 2011

For crying out cloud it has been an interesting weekend

As some folks may know, ICPSR makes pretty extensive use of ICPSR's cloud service, and so it has been interesting, and even a little painful, to watch the recent multi-day outage play out.

The timing of the outage wasn't that bad for us.  To the extent we use the Amazon cloud to perform research as part of our NIH "cloud grant" losing access over a long weekend wasn't that bad.  At worst we lost one day (Friday) and one afternoon (Thursday) where we wouldn't have been able to do everything we might have wanted.

We also got lucky that our production web site didn't suffer any problems - and wasn't scheduled for any maintenance - this weekend.  We run our replica system in Amazon's cloud, and had we lost the main site here at ICPSR, we would have been in very bad shape.

We also keep an encrypted copy of our holdings in Amazon's cloud, and over the past few days we haven't been able to keep it sync'd with master copies from here.  But since we have so many copies in so many different locations, this wasn't all that worrisome.  In fact, if we synchronized content weekly instead of daily, and if we always performed the synchronization on, say, Wednesday, the outage would have been a non-issue for this purpose.  (But, of course, if the weekly synchronization was performed on Saturdays instead of Wednesday, we'd be drifting even more out of sync.)

The primary pain point was with a production web site we run exclusively in the cloud,  For all practical purposes this site was off-line from Friday morning (EDT) until the middle of the day on Saturday.  That's not good.  But it's also the case that the service was never designed for 24 x 7 production, and so it's not surprising that it could suffer an extended outage.  Building in fault-tolerance costs more.

The biggest lesson for us is that we need to make sure that some of our oldest, most long-lived instances are moved to EBS-backed instances.  This is something that we've been meaning to do for some time, and this serves as a reminder of why it would be good to make it a higher priority.

Friday, April 22, 2011

March deposits at ICPSR

March 2011 deposits at ICPSR:

# of files# of depositsFile format
93text/plain; charset=iso-8859-1
53text/plain; charset=unknown
23118text/plain; charset=us-ascii
11text/x-c; charset=us-ascii

Nothing too unusual here.  Lots of plain-text, SPSS, SAS, PDF, and MS Office formats (like Word and Excel).  Still need to update our system so that we bucket purported C source code files (text/x-c) as plain-text instead.

Wednesday, April 20, 2011

TRAC: B3.4: Demonstrating effctive preservation planning

B3.4 Repository can provide evidence of the effectiveness of its preservation planning.

The repository should be able to demonstrate the continued preservation, including understandability, of its holdings over a number of years, given the age of the repository and its holdings.

This could be evaluated at a number of degrees and depends on the specificity of the designated community(ies). If a designated community is fairly broad, an auditor could represent the test subject in the evaluation. More specific designated communities could require significant efforts. If judgment must be exercised as to whether adequate efforts have been made, it must be justified in detail.

Evidence: Collection of appropriate preservation metadata; proof of usability of randomly selected digital objects held within the system; demonstrable track record for retaining usable digital objects over time.

There are a couple of different stories we might tell related to this TRAC requirement.

One would be the "proof is in the pudding" story where we might assert that, of course, ICPSR has effective preservation planning:  We still preserve and distribute content that we originally collected decades ago.  In fact, if you spend enough time wandering around the less traveled portions of our collections, you'll likely find content which is very old, but which is still reasonably useful.  (It might be lacking formats for the modern stats packages, and possibly even setups, but there will always be documentation.)  This speaks to the "track record" reference above.

Another possible answer is to point to the regular audits we perform each week where we check the digital signature of each item on disk to the signature in a database.  This allows us to spot problems quickly, and that is a really important factor in allowing us to do a good job in managing problems.  This doesn't really prove usability per se, but it does demonstrate that the content (plain text data and its documentation) has not become corrupted.

And yet another possible answer is to address the collection of appropriate preservation metadata.  Each item (file) has a unique ID (reference); a digital signature (fixity); co-location with its related content (context); and, a MIME type (representation).  And for all of our newer content post-dating the deposit system, we have very good provenance information.  (Older stuff is more of a mixed bag of good provenance information, especially prior to the 1990's.)

Thursday, April 14, 2011

Is it April already?

Things are really humming at ICPSR these days.  I've been so busy with a new grant application, the recent Council meeting, a major infrastructure project, and more that I've been neglecting the blog.  I hope to take care of that oversight very soon, and plan to queue up several more posts tomorrow.  But for now...

I gave a talk at ICPSR yesterday on "the cloud."  I used Prezi for creating and playing the "slides," and you can find a link to a copy of it here.  And I will also embed it here:

I used the iPad 2 to play the presentation, and I think it worked pretty well.  The Prezi iPad app does not support video or audio at this time, and that meant I could not include a few clips I had planned to use.  If the Prezi guys can fix this up, I think Prezi and the iPad will be my new main platform for giving talks.