The repository system must be able to identify the number of copies of all stored digital objects, and the location of each object and their copies. This applies to what are intended to be identical copies, not versions of objects or copies. The location must be described such that the object can be located precisely, without ambiguity. It can be an absolute physical location or a logical location within a storage media or a storage subsystem. One way to test this would be to look at a particular object and ask how many copies there are, what they are stored on, and where they are. A repository can have different policies for different classes of objects, depending on factors such as the producer, the information type, or its value. Some repositories may have only one copy (excluding backups) of everything, stored in one place, though this is definitely not recommended. There may be additional identification requirements if the data integrity mechanisms use alternative copies to replace failed copies.
Evidence: random retrieval tests; system test; location register/log of digital objects compared to the expected number and location of copies of particular objects.
Our story here is a mixed bag of successes and barriers.
For the master copy of any object we can easily and quickly specify its location. And for the second (tape) copy, we too can easily specify the location as long as we're not too specific. For example, we can point to the tape library and say, "It's in there." And, of course, with a little more work, we can use our tape management system to point us to the specific tape, and the location on that tape. Maintaining this information outside of the tape management system would be expensive, and it's not clear if there would be any true benefit.
The location of other copies can be derived easily, but those specific locations are not recorded in a database. For example, let's say that the master copy of every original deposit we have is stored in a filesystem hierarchy like /archival-storage/deposits/deposit-id/
In this scenario we might have a large degree of confidence that we could find a copy of any given deposit on that off-site location. We know it's there because rsync told us it put it there. But we don't have a central catalog that says that deposit #1234 is stored under /archival-storage/deposits/1234, on tape, and at remote-site.icpsr.umich.edu/archival-storage/deposits/1234. One could build exactly such a catalog, of course, and then create the process to keep it up to date, but would it have much value? What if all we did was tell a wrapper around rsync to capture the output and update the catalog?
And so if we interpret the TRAC requirement to build a location register to mean that we should have a complete, enumerated list of each and every copy, then we don't do so well here. But if we interpret the requirement to mean that we can find a copy by looking on a list (i.e., the catalog proper) or look at a rule (i.e., if the master copy is in location x, then two other copies can be found by applying functions f(x) and g(x)), then we're doing pretty well after all.
Limitations in storage systems also add complexity. For instance, I was once looking at Amazon's S3 as a possible location for items in archival storage. But S3 doesn't let me have objects bigger than 5GB, and since I sometimes have very large files, this means that the record-keeping would be even more complicated. For an object with name X, you can find it in this S3 bucket, unless it is bigger than 5GB, in which case you need to look for N different objects and join them together. Ick.