Friday, March 11, 2011

TRAC: B3.2: Tracking format obsolescence

B3.2 Repository has mechanisms in place for monitoring and notification when Representation Information (including formats) approaches obsolescence or is no longer viable.

For most repositories, the concern will be with the Representation Information (including formats) used to preserve information, which may include information on how to deal with a file format or software that can be used to render or process it. Sometimes the format needs to change because the repository can no longer deal with it. Sometimes the format is retained and the information about what software is needed to process it needs to change.

In all cases, the repository must show that it has some active mechanism to warn of impending obsolescence. Obsolescence is determined largely in terms of the knowledge base of the designated community(ies). This requirement ensures that the preserved information remains understandable and usable by the designated community(ies). If the mechanism depends on an external registry, the repository must demonstrate how it uses the information from that registry.

Evidence: Subscription to a format registry service; subscription to a technology watch service; percentage of at least one staff member dedicated to monitoring technological obsolescence issues.

ICPSR has two very different stories regarding this requirement.

In terms of content that is has created from deposited materials, the story is very simple.  The archival holdings consist of data in plain text format, and related documentation in both PDF and TIFF image format.  This content can be used to generate more researcher-friendly formats, such as a SAS Transport file, and it is this latter content that we make available to our community.

For this content obsolescence is easy to track and manage:  the pool of content types is very homogeneous.

In terms of content that ICPSR has received from depositors (researchers, federal agencies, news organizations, survey research centers), the content is extremely heterogeneous.  Our strategy here is to keep the original content, but only preserve it at the bit-level.  We also normalize the content into more durable formats, such as plain text, and those receive what we call "full preservation."  For example, if a researcher sends us technical documentation about a dataset in WordPerfect format, we'll keep it, preserve it at the bit-level, and then be prepared to discard it one day if WordPerfect files become unreadable.  However, soon after we receive the WordPerfect file, a data manager at ICPSR will transform the content into plain text, and that will be much more durable (although possibly lossy).  And as part of our data processing work, this same data manager will produce an ICPSR original documentation file - based heavily on the original WordPerfect, of course - in PDF format (which will also be imaged as TIFF).

Checking the current format types in the repository is a simple database query.

My sense is that the main TODO item for ICPSR is to follow the guidance in the Evidence section of this TRAC requirement, and assign some fraction of two different staff to monitoring and managing older formats.  A technology person can create automated jobs to check the repository and broadcast information about risky formats, and can also help write tools and one-time scripts to assist in their migration to newer formats (if needed).  A content person can help identify which content is truly useful, and which has important intellectual content that should receive top priority.  Also, because this latter person will better understand the meaning of the content, s/he will be poised to know if the migrated content does indeed capture the intellectual properties of the original.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.