Wednesday, October 21, 2009

TRAC: C1.4: Synchronizing objects

C1.4 Repository has mechanisms in place to ensure any/multiple copies of digital objects are synchronized.

If multiple copies exist, there has to be some way to ensure that intentional changes to an object are propagated to all copies of the object. There must be an element of timeliness to this. It must be possible to know when the synchronization has completed, and ideally to have some estimate beforehand as to how long it will take. Depending whether it is automated or requires manual action (such as the retrieval of copies from off-site storage), the time involved may be seconds or weeks. The duration itself is immaterial—what is important is that there is understanding of how long it will take. There must also be something that addresses what happens while the synchronization is in progress. This has an impact on disaster recovery: what happens if a disaster and an update coincide? If one copy of an object is altered and a disaster occurs while other copies are being updated, it is essential to be able to ensure later that the update is successfully propagated.

Evidence: Workflows; system analysis of how long it takes for copies to synchronize; procedures/documentation of operating procedures related to updates and copy synchronization; procedures/documentation related to whether changes lead to the creation of new copies and how those copies are propagated and/or linked to previous versions.

I think we have a good story to tell.

As new objects enter Archival Storage at ICPSR, they reside in a well-known, special-purpose location. Automated, regularly scheduled system jobs synchronize those objects with remote locations using standard, established tools such as rsync and other, less common tools such as Storage Resource Broker (SRB) command-line utilities, such as Srsync.

The output of these system jobs is captured and delivered nightly to a shared electronic mailbox. The mailbox is reviewed on a daily basis; this task belongs to the member of the ICPSR IT team who is currently on-call. When a report is missing or when a report indicates an error, the problem is escalated to someone who can diagnose and correct the problem. One common problem, for example, occurs when an object larger than 2GB enters Archival Storage and the SRB Srsync utility faults. (SRB limits objects to 2GB.) We then remove this object from the list of items to be synchronized with SRB.

Because the synchronization process is incremental, it has a very short duration. However, if we were to need to synchronize ALL content, this takes on the order of days or even weeks. For example, we recently synchronized a copy of our Access holdings to a computing instance residing in Amazon's EC2 EU-West region, and we found it took approximately one week to copy about 500GB. As another example, we recently synchronized a copy of our Archival Storage (which is much larger than the Access collection) to a system, which like ICPSR and the University of Michigan, is connected to Internet2's Abilene network, and that took far less time.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.