Monday, November 16, 2009

ICPSR Content and Availability



Legend:

Blue = Archival Storage
Yellow = Access Holdings
Green = both Archival Storage and Access Holdings
Red Outline = Web-delivered copy of Access Holdings





We're getting close to the one-year anniversary of the worst service outage in (recent?) ICPSR history. On Monday, December 28th, 2008 powerful winds howled through southeastern lower Michigan, knocking out power to many, many thousands of homes and businesses. One business that lost power was ICPSR.

No data was lost, and no equipment was damaged, but ICPSR's machine room went without power nearly until New Year's Day. In many ways we were lucky: The long outage happened during a time when most scholars and other data users are enjoying the holidays, and there was no physical damage to repair. The only "fix" was to power up the equipment once the building had power again.

However, this did serve as a catalyst for ICPSR to focus resources and money on its content delivery system, and therefore on its content replication story too. Some elements of the story below predate the 2008 winter storm, but many of the elements are relatively new.



ICPSR manages two collections of content: archival storage and access holdings.

Archival storage consists of any digital object that we intend to preserve. Examples include original deposits, normalized versions of those deposits, normalized versions of processed datasets, technical documentation in durable formats such as TIFF or plain text, metadata in DDI XML, and so on. If a particular study (collection of content) has been through ICPSR's pipeline process N different types, say due to updates or data resupplies, then there will be N different versions of the content in archival storage.

Access holdings consist of only the latest copy of an object, and often include formats that we do not preserve. For example, while we might preserve only a plain text version of a dataset, we might make the dataset available in contemporary formats such as SPSS, SAS, and Stata to make it easy for researchers to use. Anything in our access holdings would be available for download on our Web site, and therefore doesn't contain confidential or sensitive data. Much of the content, particularly more modern files, would have passed through a rigorous disclosure review process.

The primary location of ICPSR's archival storage is a EMC Celera NS501 Network Attached Storage device. In particular, a multi-TB filesystem created from our pool of SATA drives provides a home for all of our archival holdings.

ICPSR replicates its archival storage in three locations:
  1. San Diego Supercomputer Center (synchronized via the Storage Resource Broker)
  2. MATRIX - The Center for Humane Arts, Letters, & Science Online at Michigan State University (synchronized via rsync)
  3. A tape backup system at the University of Michigan (snapshots)
We are also working on adding a fourth replica at the H. W. Odum Institute for Research in Social Science at the University of North Carolina - Chapel Hill.

Some of our content stored at the San Diego Supercomputer Center - a snapshot in time from 2008 - is also replicated in the Chronopolis Digital Preservation Demonstration Project, and that gives us two additional copies of many objects.

An automated process compares the digital signature of each object in archival storage and compares it to a digital signature calculated "on the fly." If the signatures do not match, the object is flagged for further investigation.

The primary location for ICPSR's access holdings is also the EMC NAS. But in this case, the content is stored on a much smaller filesystem built from our pool of high-speed, FC disk drives.

ICPSR replicates its access holdings in five locations:
  1. San Diego Supercomputer Center (synchronized via the Storage Resource Broker)
  2. A tape backup system at the University of Michigan (snapshots)
  3. A file storage cloud hosted by the University of Michigan's Information Technology Services
  4. An Amazon Web Services (AWS) Elastic Computing Cloud (EC2) instance located in the EU region
  5. An Amazon Web Services (AWS) Elastic Computing Cloud (EC2) instance located in the US region
Only the last replica above contains the necessary software and support systems (e.g., an Oracle database system) to actually deliver ICPSR's content; all of the other systems contain a complete snapshot of our access holdings, but not the platform with which to deliver the content.

The AWS-hosted replica has been used twice so far in 2009. We performed a "lights out" test of the replica in mid-March, and we performed a "live" failover due to another power outage in May. In both cases the replica worked as expected, and the amount of downtime was reduced dramatically.

And, finally, our access holdings and our delivery platform are available on the ICPSR Web staging system. But because the purpose of this system is to stage and test new software and new Web content, this is very much an "emergency only" option for content delivery.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.