Wednesday, November 4, 2009

TRAC: C1.6: Reporting and repairing loss

C1.6 Repository reports to its administration all incidents of data corruption or loss, and steps taken to repair/replace corrupt or lost data.

Having effective mechanisms to detect bit corruption and loss within a repository system is critical, but is only one important part of a larger process. As a whole, the repository must record, report, and repair as possible all violations of data integrity. This means the system should be able to notify system administrators of any logged problems. These incidents, recovery actions, and their results must be reported to administrators and should be available.

For example, the repository should document procedures to take when loss or corruption is detected, including standards for measuring the success of recoveries. Any actions taken to repair objects as part of these procedures must be recorded. The nature of this recording must be documented by the repository, and the information must be retrievable when required. This documentation plays a critical role in the measurement of the authenticity and integrity of the data held by the repository.

Evidence: Preservation metadata (e.g., PDI) records; comparison of error logs to reports to administration; escalation procedures related to data loss.

My sense is that this requirement is just about policy as it is process. Fortunately for our data holdings (but unfortunately for TRAC preparation), data loss or corruption is a very infrequent event, and therefore as one might expect, the set of policies and written processes is pretty small.

As a point of comparison, if we look at our policies and processes for handling loss with "working files" we will find a much richer set of policies and systems. We have established infrastructure (an EMC Network Attached Storage (NAS) storage applicance and associated Dell tape management solution); we have internal policies and processes that document how to retrieve lost content; we have external policies that describe which parts of the NAS are written to tape, and the schedule of tape backups; and, we exercise the system on a regular basis as people inadvertently delete or damage files with which they are working actively.

On the Archival Storage side - or even the Access side, where we also look for loss and corruption - the number of data loss or data corruption events is very, very low. Email reports come out on a regular basis, but they always (almost) say that everything is fine. And on that rare occasion where there is an issue, the remedy is quick.

Perhaps the right solution here is to use the small sample of issues that have arisen over the years as our baseline for writing up a process, and then posting that process on our internal web site. That would be easy to do. But then the concern is this: If a policy is used very, very infrequently, it is likely to fall into disrepair. It is also likely to become forgotten. Maybe the tool that examines for loss or corruption should also contain a link to the relevant policies and recovery processes?

What strategies have others used to address this TRAC requirement?

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.