Wednesday, October 14, 2009

TRAC: C1.3: Managing all objects

C1.3 Repository manages the number and location of copies of all digital objects.

The repository system must be able to identify the number of copies of all stored digital objects, and the location of each object and their copies. This applies to what are intended to be identical copies, not versions of objects or copies. The location must be described such that the object can be located precisely, without ambiguity. It can be an absolute physical location or a logical location within a storage media or a storage subsystem. One way to test this would be to look at a particular object and ask how many copies there are, what they are stored on, and where they are. A repository can have different policies for different classes of objects, depending on factors such as the producer, the information type, or its value. Some repositories may have only one copy (excluding backups) of everything, stored in one place, though this is definitely not recommended. There may be additional identification requirements if the data integrity mechanisms use alternative copies to replace failed copies.

Evidence: random retrieval tests; system test; location register/log of digital objects compared to the expected number and location of copies of particular objects.



Our story here is a mixed bag of successes and barriers.

For the master copy of any object we can easily and quickly specify its location. And for the second (tape) copy, we too can easily specify the location as long as we're not too specific. For example, we can point to the tape library and say, "It's in there." And, of course, with a little more work, we can use our tape management system to point us to the specific tape, and the location on that tape. Maintaining this information outside of the tape management system would be expensive, and it's not clear if there would be any true benefit.

The location of other copies can be derived easily, but those specific locations are not recorded in a database. For example, let's say that the master copy of every original deposit we have is stored in a filesystem hierarchy like /archival-storage/deposits/deposit-id/. And let's say that on a daily basis we synchronize that content via rsync to an off-site location, say, remote-location.icpsr.umich.edu:/archival-storage/deposits/deposit-id/. And let's also say that someone reviews the output of the rsync run on a daily basis, and also performs a random spot-check on an irregular basis.

In this scenario we might have a large degree of confidence that we could find a copy of any given deposit on that off-site location. We know it's there because rsync told us it put it there. But we don't have a central catalog that says that deposit #1234 is stored under /archival-storage/deposits/1234, on tape, and at remote-site.icpsr.umich.edu/archival-storage/deposits/1234. One could build exactly such a catalog, of course, and then create the process to keep it up to date, but would it have much value? What if all we did was tell a wrapper around rsync to capture the output and update the catalog?

Probably not.

And so if we interpret the TRAC requirement to build a location register to mean that we should have a complete, enumerated list of each and every copy, then we don't do so well here. But if we interpret the requirement to mean that we can find a copy by looking on a list (i.e., the catalog proper) or look at a rule (i.e., if the master copy is in location x, then two other copies can be found by applying functions f(x) and g(x)), then we're doing pretty well after all.

Limitations in storage systems also add complexity. For instance, I was once looking at Amazon's S3 as a possible location for items in archival storage. But S3 doesn't let me have objects bigger than 5GB, and since I sometimes have very large files, this means that the record-keeping would be even more complicated. For an object with name X, you can find it in this S3 bucket, unless it is bigger than 5GB, in which case you need to look for N different objects and join them together. Ick.

2 comments:

  1. Thank you for sharing this with us.

    Just a few lines of comment to let you know I am an interested reader.

    Good to know that you can find back all your stuff in all the locations.

    To mitigate for what riscs are you creating all the extra copies?

    How do you check the integrity of all the dispersed copies?

    A problem with rsync is that if one of your originals gets corrupted, the next day the rsync copy is also corrupted.

    with kind regards,

    Henk Koning

    ReplyDelete
  2. Thanks for writing, Henk.

    Our Digital Preservation Officer, Nancy McGovern, recommends six or seven copies of each digital object, and so the extra copies are merely my attempt to close the gap between the number of copies we have maintained historically (2) and the number she suggests.

    We run a regular process that compares the stored MD5 hash for each file against one we calculate on the fly. If there is a discrepancy, we know that the file has become corrupted. We've been doing this for a few years now, and I've only seen it happen once or twice, and always as a result of human error.

    And, you're exactly right about the problem with rsync, particularly in the case of human error where the file gets corrupted AND the ctime timestamp is updated too. In this scenario we need to pull a copy from a non-rsync'd location (like tape). In the case of "bit rot" where the file is corrupted, but the ctime doesn't change, it would be possible to recover the master file from one of the rsync'd copies.

    ReplyDelete

Note: Only a member of this blog may post a comment.