Saturday, March 26, 2011

TRAC: B3.3: Updating plans in response to monitoring

B3.3 Repository has mechanisms to change its preservation plans as a result of its monitoring activities.

The repository must demonstrate or describe how it reacts to information from monitoring, which sometimes requires a repository to change how it deals with the material it holds in unexpected ways. Plans as simple as migrating from format X to format Y when the registries show that format X is no longer supported are not sufficiently flexible—other events may have made format Y a bad choice. The repository must be prepared for changes in the external environment that may make its current plan (to migrate from X to Y in 10 years) a bad choice as the time to implement draws near. The repository should be able to show that it can revise long-range plans in light of changing circumstances.

Another possible response to information gathered by monitoring is for the repository to create additional Representation Information and/or PDI.

Evidence: Preservation planning policies tied to formal or information technology watch(es); preservation planning or processes that are timed to shorter intervals (e.g., not more than five years); proof of frequent preservation planning/policy updates.



My team at ICPSR supports this function, but is not directly responsible for preservation planning, so this post will be brief. 

My experience is that ICPSR has a very robust preservation planning function which engages regularly with a wide array of communities, and which therefore keeps a close eye on changes in what the community needs, or changes that are necessitated by technology.

I suppose the best evidence for this might be the major change we undertook a few years ago to move all digital content from a variety of legacy formats to "spinning disk."  This was a major project, of course, but allowed ICPSR to stop worrying about the availability of legacy media readers and media which could fail without quick detection.

February Deposits at ICPSR

February 2011 deposits (and their file formats) at ICPSR:

# of files# of depositsFile format
11application/msoffice
28423application/msword
114application/octet-stream
10425application/pdf
75application/vnd.ms-excel
2410application/x-sas
8028application/x-spss
264application/x-stata
21application/x-stuffit
22application/x-zip
196message/rfc8220117bit
153text/html
92text/plain; charset=iso-8859-1
22text/plain; charset=unknown
12221text/plain; charset=us-ascii
95text/rtf
11text/x-c; charset=iso-8859-1
227text/x-c; charset=us-ascii
11text/xml

Nothing too exciting this month. 

[ I thought I had posted this weeks ago, but clearly not. ]

ICPSR Machine Room maintenance

I don't see an announcement about this on the main web site, so I'll drop a short note here....

The University of Michigan Plant Operations department is performing some electrical work in our machine room starting at 8am this morning.  The work will last some unknown number of hours; my guess is 4-6.

We have just diverted traffic from our production web site over to the replica in the cloud.  Our most common services such as search, download, and on-line analysis will be available.  But services which require our web site to "remember" something -- such as data deposit, Summer Program registration -- are not enabled in the replica.

I'll post a follow-up once the work has completed, and will check back here periodically for comments.  If you see something that isn't working as you think it should, please comment here, and we'll take a look.

----

10:22am EDT : We're aware that study search is not working.  (Variable search and bibliography search are working.)  ICPSR recently updated the search to include the full-text of many types of documents (codebooks, survey instruments, user guides, etc), and the index location changed too.  This change was not reflected on the cloud replica, and the index there is now building.  Our apologies for the inconvenience this may cause.

ETA for conclusion of the electrical work is between 11am and noon EDT.

-----

11:18am EDT: The power work has been completed.  We've restored service to the main site.

Friday, March 11, 2011

TRAC: B3.2: Tracking format obsolescence

B3.2 Repository has mechanisms in place for monitoring and notification when Representation Information (including formats) approaches obsolescence or is no longer viable.

For most repositories, the concern will be with the Representation Information (including formats) used to preserve information, which may include information on how to deal with a file format or software that can be used to render or process it. Sometimes the format needs to change because the repository can no longer deal with it. Sometimes the format is retained and the information about what software is needed to process it needs to change.

In all cases, the repository must show that it has some active mechanism to warn of impending obsolescence. Obsolescence is determined largely in terms of the knowledge base of the designated community(ies). This requirement ensures that the preserved information remains understandable and usable by the designated community(ies). If the mechanism depends on an external registry, the repository must demonstrate how it uses the information from that registry.

Evidence: Subscription to a format registry service; subscription to a technology watch service; percentage of at least one staff member dedicated to monitoring technological obsolescence issues.



ICPSR has two very different stories regarding this requirement.

In terms of content that is has created from deposited materials, the story is very simple.  The archival holdings consist of data in plain text format, and related documentation in both PDF and TIFF image format.  This content can be used to generate more researcher-friendly formats, such as a SAS Transport file, and it is this latter content that we make available to our community.

For this content obsolescence is easy to track and manage:  the pool of content types is very homogeneous.

In terms of content that ICPSR has received from depositors (researchers, federal agencies, news organizations, survey research centers), the content is extremely heterogeneous.  Our strategy here is to keep the original content, but only preserve it at the bit-level.  We also normalize the content into more durable formats, such as plain text, and those receive what we call "full preservation."  For example, if a researcher sends us technical documentation about a dataset in WordPerfect format, we'll keep it, preserve it at the bit-level, and then be prepared to discard it one day if WordPerfect files become unreadable.  However, soon after we receive the WordPerfect file, a data manager at ICPSR will transform the content into plain text, and that will be much more durable (although possibly lossy).  And as part of our data processing work, this same data manager will produce an ICPSR original documentation file - based heavily on the original WordPerfect, of course - in PDF format (which will also be imaged as TIFF).

Checking the current format types in the repository is a simple database query.

My sense is that the main TODO item for ICPSR is to follow the guidance in the Evidence section of this TRAC requirement, and assign some fraction of two different staff to monitoring and managing older formats.  A technology person can create automated jobs to check the repository and broadcast information about risky formats, and can also help write tools and one-time scripts to assist in their migration to newer formats (if needed).  A content person can help identify which content is truly useful, and which has important intellectual content that should receive top priority.  Also, because this latter person will better understand the meaning of the content, s/he will be poised to know if the migrated content does indeed capture the intellectual properties of the original.

Friday, March 4, 2011

TRAC: B3.1: Documented preservation strategies

B3.1 Repository has documented preservation strategies.

A repository or archiving system must have current, sound, and documented preservation
strategies. These will typically address the degradation of storage media, the obsolescence of media drives, and the obsolescence of Representation Information (including formats), safeguards against accidental or intentional digital corruption. For example, if migration is the chosen approach to some of these issues, there also needs to be policy on what triggers a migration and what types of migration are expected for the solution of each preservation issue identified.

Evidence: Documentation identifying each preservation issue and the strategy for dealing with that issue.



ICPSR documents this thoroughly on the Digital Curation portion of its web site.