Technology at ICPSR: TRAC: B2.6: Matching unique SIP IDs to AIP IDs

B2.6 If unique identifiers are associated with SIPs before ingest, the repository preserves the identifiers in a way that maintains a persistent association with the resultant archived object (e.g., AIP).

SIPs will not always contain unique identifiers when the repository receives them. But where they do, and particularly where those identifiers were widely known before the objects were ingested, it is important that they are either retained as is, or that some mechanism allows the original identifier to be transformed into one used by the repository.

For example, consider an archival repository whose SIPs consist of file collections from electronic document management systems (EDMS). Each incoming SIP will contain a unique identifier for each file within the EDMS, which may just be the pathname to the file. The repository cannot use these as they stand, since two different collections may contain files with the same pathname. The repository may generate unique identifiers by qualifying the original identifier in some way (e.g., refixing the pathname with a unique ID assigned to the SIP of which it was a part). Or it may simply generate new unique numeric identifiers for every file in each SIP. If it qualifies the original identifier, it must explain the scheme it uses. If it generates entirely new identifiers, it will probably need to maintain a mapping between original IDs and generated IDs, perhaps using object-level metadata.

Documentation must show the policy on handling the unique identification of SIP components as the objects to be preserved are ingested, preserved, and disseminated. Where special handling is required, this must be documented for each SIP as a part of the provenance information capture (see B2.3).

Evidence: Workflow documents and evidence of traceability (e.g., SIP identifier embedded in AIP, mapping table of SIP IDs to AIPs).

My sense is that ICPSR is in pretty good shape on this requirement. Here is a stab at the documentation.

ICPSR collects the original file name (sanitized to avoid problems via SQL injection and other attacks) for each item at the time of deposit. Each item also receives a unique ID in our deposit tracking system. And the deposit event itself also receives a unique ID which is also exposed to the depositor.

During the data curation process an ICPSR data manager will mint a new container-level object for access purposes (i.e., a study), and will fill that container with materials derived from the deposit. This activity can run the gamut from a one-to-one mapping between deposited files and released files to something far more complex. In any event ICPSR records and preserves the mapping between the aggregate-level objects, deposits and studies. ICPSR also has plans to record and preserve the mapping at the more fine-grained level of each file.

Technology at ICPSR

Friday, November 19, 2010

TRAC: B2.6: Matching unique SIP IDs to AIP IDs

No comments:

Post a Comment