Friday, January 14, 2011

TRAC: B2.9: Collecting Preservation Metadata

B2.9 Repository acquires preservation metadata (i.e., PDI) for its associated Content Information.

Preservation metadata (PDI) is needed not only by the repository to help ensure the Content Information is not corrupted (Fixity) and is findable (Reference Information), but to help ensure the Content Information is adequately understandable by providing a historical perspective (Provenance Information) and by providing relationships to other information (Context Information). The extent of such information needs is best addressed by members of the designated community(ies). The PDI must be permanently associated with Content Information.

Evidence: Viewable records in local format registry (with persistent links to digital objects); local metadata registry(ies); database records that include Representation Information and a persistent link to relevant digital objects.

My sense is that ICPSR is in pretty good shape on this item.  We have an entire database schema devotes to our objects in Archival Storage, and we collect (and store) a great deal of information about content starting when a depositor first gives it to us (e.g., checksum via md5, identity of depositor, file format), and then extending throughout the ingest process.  This content is readily available to internal staff through a no-frills, but extremely useful database browser that someone on my team built several years ago.

One thing that we don't do today is annotate original content at the file level with information about its role; I think this would fall within the scope of Context Information above.  For instance, if someone deposits a file called MySurvey.xls, we do not add a comment that says, "Oh, this is survey data."  I've been participating in meetings where this particular issue is getting some discussion, and my expectation is that sometime this year we (in the IT shop) will need to build some simple extensions to existing systems to implement features like this.

One bit of Context Information that we do collect (at the end of ingest) is a mapping at the aggregate level between original submissions to ICPSR (i.e., deposits) and the content we ingest and deliver (i.e., studies).

