Friday, June 3, 2011

TRAC: B4.4: Monitoring integirty

B4.4 Repository actively monitors integrity of archival objects (i.e., AIPs).

In OAIS terminology, this means that the repository must have Fixity Information for AIPs and must make some use of it. At present, most repositories deal with this at the level of individual information objects by using a checksum of some form, such as MD5. In this case, the repository must be able to demonstrate that the Fixity Information (checksums, and the information that ties them to AIPs) are stored separately or protected separately from the AIPs themselves, so that someone who can maliciously alter an AIP would not likely be able to alter the Fixity Information as well. A repository should have logs that show this check being applied and an explanation of how the two classes of information are kept separate.

AIP integrity also needs to be monitored at a higher level, ensuring that all AIPs that should exist actually do exist, and that the repository does not possess AIPs it is not meant to. Checksum information alone will not be able to demonstrate this.

Evidence: Logs of fixity checks (e.g., checksums); documentation of how AIPs and Fixity information are kept separate.



ICPSR calculates a fingerprint (via MD5) on each object as it enters archival storage.  This fingerprint is stored in a relational database, and is keyed to the location of the object.  The object itself is stored in a conventional Linux filesystem.  The filesystem and the relational database's storage area are on physically different systems.

A weekly job runs through the list of objects in the database, and it compares the stored fingerprint to one it calculates (on the fly) for the object.  If there is a mismatch, the job reports an error.

Regardless of whether or not there is an error, the weekly job sends a status report.

In the case of error, a systems administrator investigates almost immediately.  (We check fixity over the weekend; a problem might not be investigated until Monday morning.)  The most common error we have seen so far has been due to errors in the ingest process.  A typical scenario is that a data manager submits an object for ingest, and the job fails for some reason, leaving the database and the object store out of sync.  A typical remedy is that a systems administrator isolates and corrects the fault (or the data managers corrects the content), and the job is resubmitted.  If any partial content (in either the database or the object store) has gotten stuck, the system administrator clears it.  Again, in the most typical case, the system administrator would discard the "stuck" content, and the data manager would resubmit the objects for ingest.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.