Friday, October 29, 2010

TRAC: B2.3: SIPs to AIPs

B2.3 Repository has a description of how AIPs are constructed from SIPs.

The repository must be able to show how the preserved object is constructed from the object initially submitted for preservation. In some cases, the AIP and SIP will be almost identical apart from packaging and location, and the repository need only state this. More commonly, complex transformations (e.g., data normalization) may be applied to objects during the ingest process, and a precise description of these actions (i.e., preservation metadata) may be necessary to ensure that the preserved object represents the information in the submitted object. The AIP construction description should include documentation that gives the provenance of the ingest process for each SIP to AIP transformation, typically consisting of an overview of general processing being applied to all such transformations, augmented with description of different classes of such processing and, when applicable, with special transformations that were needed.

Some repositories may need to produce these complex descriptions case by case, in which case diaries or logs of actions taken to produce each AIP will be needed. In these cases, documentation needs to be mapped between to individual AIPs, and the mapping needs to be available for examination. Other repositories that can run a more production-line approach may have a description for how each class of incoming object is transformed to produce the AIP. It must be clear which definition applies to which AIP. If, to take a simple example, two separate processes each produce a TIFF file, it must be clear which process was applied to produce a particular TIFF file.

Evidence: Process description documents; documentation of SIP relationship to AIP; clear documentation of how AIPs are derived from SIPs; documentation of standard/process against which normalization occurs; documentation of normalization outcome and how outcome is different from SIP.



Note:  This particular area is under active discussion at ICPSR.  The commentary below describes current processes in place at ICPSR, but these processes are likely to change in the future.

Content enters archival storage at two different points in the lifecycle of a "data processing project" at ICPSR.

When a Deposit is signed, its payload of content (its files) enter archival storage for bit-level preservation.  The system generates a unique ID for each deposited file and also keeps track of each file's MIME type, digital signature (via MD5 hash), original name (but sanitized to avoid SQL injection attacks and other like problems), and the date it was received.

A cognizant archive manager assigns each new deposit to a data manager.  In the most easy, most trivial case, a data manager may package the submission for long-term preservation and release on the ICPSR web site with little added work.  The data manager packages the content into an ICPSR "study" object, collaborates with others at ICPSR to author descriptive and preservation metadata, and performs a series of quality control checks, some of which are automated.  Workflow tools record major milestones in the life of the project, and the data manager creates an explicit linkage between deposit and study for future reference.  And the system also assigns a unique ID to each of these "processed" files, and captures metadata like digital signature, MIME type, etc.

Thus, at the aggregate level, ICPSR collects strong documentation mapping submission objects to archival objects, but the documentation is much weaker, and often absent, at the more detailed level of files.  For example, there is no explicit mapping between deposit file X and study file Y.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.