Wednesday, October 28, 2009

TRAC: C1.5: Detecting corruption and loss

C1.5 Repository has effective mechanisms to detect bit corruption or loss.

The repository must detect data loss accurately to ensure that any losses fall within the tolerances established by policy (see A3.6). Data losses must be detected and detectable regardless of the source of the loss. This applies to all forms and scope of data corruption, including missing objects and corrupt or incorrect or imposter objects, corruption within an object, and copying errors during data migration or synchronization of copies. Ideally, the repository will demonstrate that it has all the AIPs it is supposed to have and no others, and that they and their metadata are uncorrupted.

The approach must be documented and justified and include mechanisms for mitigating such common hazards as hardware failure, human error, and malicious action. Repositories that use well-recognized mechanisms such as MD5 signatures need only recognize their effectiveness and role within the overall approach. But to the extent the repository relies on homegrown schemes, it must provide convincing justification that data loss and corruption are detected within the tolerances established by policy.

Data losses must be detected promptly enough that routine systemic sources of failure, such as hardware failures, are unlikely to accumulate and cause data loss beyond the tolerances established by the repository’s policy or specified in any relevant deposit agreement. For example, consider a repository that maintains a collection on identical primary and backup copies with no other data redundancy mechanism. If the media of the two copies have a measured failure rate of 1% per year and failures are independent, then there is a 0.01% chance that both copies will fail in the same year. If a repository’s policy limits loss to no more than 0.001% of the collection per year, with a goal of course of losing 0%, then the repository would need to confirm media integrity at least every 72 days to achieve an average time to recover of 36 days, or about one tenth of a year. This simplified example illustrates the kind of issues a repository should consider, but the objective is a comprehensive treatment of the sources of data loss and their real-world complexity. Any data that is (temporarily) lost should be recoverable from backups.

Evidence: Documents that specify bit error detection and correction mechanisms used; risk analysis; error reports; threat analyses.

For each object in Archival Storage, ICPSR computes a MD5 hash. This "fingerprint" is then stored as metadata for each object.

Automated jobs "prowl" Archival Storage on a regular basis computing the current MD5 hash for an object, and comparing it to the stored version. In the case where the hashes differ, and exception is generated, and this information is reported to the appropriate staff for diagnosis and correction.

In practice we see very few exceptions such as these, and the most common cause is a blend of human-error and software failing to handle the error gracefully.

Recovery is quick. In the event the problem was caused by human-error, and the ctime (last modified) timestamp has changed, then any copies managed via rsync may also be damaged, and we instead need to fetch the original object from a different source (e.g., tape or a copy managed via SRB's Srsync). In the event the problem was caused without ctime also changing, then we also have the option of fetching an original copy from one of our rsync-managed copies.


  1. Thank you again for youre blog, Bryan.

    What you describe as the ICPSR precautions seem to address well the problem of files on disks or tapes becoming corrupted because of hardware failure.

    But I wonder whether this sufficiently covers the problem of "data loss" in the eyes of the users. For instance, when a dataset for some reason is not indexed in a search database with metadata, the dataset effectively becomes 'lost'.

    Another point is malicious action. I am not a security expert, but as far as I understand it, the MD5 checksum is suitable for detecting hardware failure, but not to detect malicious action. It is possible nowadays for a malicious person to generate a file with the same size and MD5 checksum but another content. Or someone who has access to the data and to the checksums can update both at the same time.

    These are just examples. A further risc analysis of "data loss" is needed.

  2. Thanks for the interesting comments.

    "Effective loss" v. "actual loss" is an interesting angle. In addition to the scenario you describe, one could also imagine a case where an object becomes lost in this fashion due to even the most prosaic causes, such as a typo in metadata tags (e.g., "crmie" v. "crime"), and so is hard (or impossible) to find.

    I'm also struggling to determine if this sort of loss - perhaps described as Access loss v. Archival Storage loss? - is within scope of TRAC? If an object is easily retrieved by its unique ID or through a crawl of Archival Storage, is it OK in the context of C1.6?

    The malicious action point is very well taken.

    Thanks again for commenting!


Note: Only a member of this blog may post a comment.