Technology at ICPSR: November 2009

Friday, November 20, 2009

ICPSR and the Cloud

Interest in the cloud is heating up on the University of Michigan campus. In the past day or two I've seen surveys asking campus IT leaders to comment on their exploratory interest in the cloud, and have also answered email queries about who is dabbling with the cloud.

ICPSR started exploring the Amazon cloud in late 2008, and by early 2009 we had setup our first production service, a stealth authoritative DNS server for our domains, and a replica of our Web service infrastructure. We're primarily uses of the Elastic Computing Cloud (EC2) and Simple Storage Service (S3). We're also looking at CloudFront, but to be honest, I'm not sure we generate the volume of traffic that would make it super useful to us.

Since them we've also launched an instance in Amazon's EU zone for disaster recovery purposes, and launched a handful of new sites on cloud instances rather than local hardware. Here's a complete roster of instances as of today:

Web server replica
Oracle database replica
CCERC Web replica
Solr search replica
Stealth authoritative DNS server (used in our DR process)
Teaching With Data NSDL Pathway production service
Teaching With Data NSDL Pathway staging server
LDAP server supporting TWD
SSDAN next generation production service
SSDAN next generation development and staging server
Content server for deep DR (located in the EU zone)

I think it's likely we'll move some of the replicas we run on behalf of the Minnesota Population Center to the cloud as well.

We've had great experiences with Amazon Web Services (AWS) so far: very low barrier to entry, and a very nice management tool in Elasticfox. The on-going operations have been stable and secure, and the price is right. And while I'm not sure I'd rely solely on the cloud for my archival storage solution, using the cloud to stash away one additional copy is very attractive.

Thursday, November 19, 2009

The ICPSR Pipeline Process

After arriving at ICPSR in 2002 one of the first things Myron asked me to do was to automate the "data processing" function at ICPSR. As I began exploring that business process, it became very clear to me that (1) the process was actually a series of six or more inter-related processes with different owners, inputs, and outputs, and (2) no single person at ICPSR had a crisp, clear understanding of the entire process. And so my dilemma: How to design and build a software system to facilitate a process which isn't documented, and isn't even well understood?

Fortunately a colleague of mine at UUNET agreed to join ICPSR: Cole Whiteman. Cole is very strong at analyzing and documenting business process, and has a particular gift for coaxing out the details and then rendering the system in an easy to understand, easy to read format. I've included the latest "Whiteman" above as a sample of his art.

Cole spent many months interviewing staff, drawing pictures, interviewing more staff, refining pictures, and so on, until he had a picture that both generated agreement - "Yep, that's the process we use!" - and demonstrated bottle-necks. Now the way was clear for automation.

Consequently, ICPSR has invested tremendous resources over the past few years building a collection of inter-connected systems that enable workflow at ICPSR. These workflow systems now form the core business process infrastructure of ICPSR, and give us the capability to support a very high-level of business. When talking to my colleagues at other data archives, my sense is that ICPSR has a unique asset. Here's a thumbnail sketch of the systems.

Deposit Form - This system manages information from the time of its first arrival via upload, until the time the depositor signs the form, transferring custody to ICPSR. The form has the capacity to collect a lot of descriptive metadata at the start of the process, and also automatically generates appropriate preservation metadata upon custody (e.g., fingerprints for each file deposited).
Deposit Viewer - This might be more appropriately named the Deposit Manager since it not only lets ICPSR staff search, browse, and view metadata about deposits, it also enables staff to manage information about deposits. For example, this is the tool we use to assign a deposit to a data manager. We also use this tool to connect deposits to studies.
Metadata Editor - This is the primary environment for creating, revising, and managing descriptive and administrative metadata about a study. Abstracts, subject terms, titles, etc. are all available for management, along with built-in connections to ICPSR business rules that control or limit selections. The system also contains the business logic that controls quality assurance.
Hermes - Our automation tool for producing the ready-to-go formats we deliver on our web site, and variable-level DDI XML for digital preservation. This system takes an SPSS System file as its input, and produces a series of files as output, some of which end up on our Web site for download, and others of which enter our archival storage system.
Turnover - Data managers use this tool to perform quality assurance tests on content which is ready for ingest, and to queue content both for insertion into archival storage and for release on the ICPSR Web site. An accompanying web application enables our release management team to accept well-formed content, and to reject objects which aren't quite ready for ingest.

Wednesday, November 18, 2009

Good to Great

I recently finished reading Good to Great by Jim Collins. I used to read business-oriented books on a more regular basis when I was working for America Online's ANS Communications division and then UUNET, and it was nice returning to that style.

The subtitle of the book is Why Some Companies Make the Leap... and Others Don't. And while the book is focused on the business world, and a common metric of success such as exceeding the average return in the major stock markets, it would be a mistake to think that this book can't teach us about the not-for-profit world that ICPSR occupies.

One tenet of the story the book tells is that organizations often lose their focus, wander into the weeds, and then suffer failure, sometimes catastrophic failure. The successful companies figure out their core mission, keep it simple, and then slowly but inexorably gain momentum to dominate and win. For example, the book contrasts the story of Gillette and Warner-Lambert. While the former focused squarely on its core, Warner-Lambert flailed between different goals, eventually being swallowed up by Pfizer.

The book refers to this type of focus as the Hedgehog Concept and breaks it into three elements:

What you are deeply passionate about
What drives your economic engine
What you can be the best in the world at

My sense is that this is an important message, particularly for successful organizations. It's easy to grow heady with success and start chasing bigger and more diverse deals, losing focus on what led to success.

Another interesting element of successful organizations was their use of "stop-doing" lists. While all organizations keep track of their "to-do" lists, which get longer and longer and longer and ..., the highly successful organizations made a conscious decisions about what to stop doing. This too resonates with me, and my experience is that if organizations don't make the hard decisions about what to stop doing, they end up spreading their resources too thinly, and then nothing gets done well.

A final interesting item I'll note here is how the budget process is described at highly successful organizations. It isn't an opportunity to ration income across a myriad of areas; rather it is an exercise to decide which areas are core and should be funded fully and completely, and which areas are not core, and should be funded not at all. Once again the root message is about focus.

There are many other very interesting observations from the research behind the book, and I'd recommend it to anyone who plays a leadership role at an organization.

Monday, November 16, 2009

ICPSR Content and Availability

Legend:

Blue = Archival Storage
Yellow = Access Holdings
Green = both Archival Storage and Access Holdings
Red Outline = Web-delivered copy of Access Holdings

We're getting close to the one-year anniversary of the worst service outage in (recent?) ICPSR history. On Monday, December 28th, 2008 powerful winds howled through southeastern lower Michigan, knocking out power to many, many thousands of homes and businesses. One business that lost power was ICPSR.

No data was lost, and no equipment was damaged, but ICPSR's machine room went without power nearly until New Year's Day. In many ways we were lucky: The long outage happened during a time when most scholars and other data users are enjoying the holidays, and there was no physical damage to repair. The only "fix" was to power up the equipment once the building had power again.

However, this did serve as a catalyst for ICPSR to focus resources and money on its content delivery system, and therefore on its content replication story too. Some elements of the story below predate the 2008 winter storm, but many of the elements are relatively new.

ICPSR manages two collections of content: archival storage and access holdings.

Archival storage consists of any digital object that we intend to preserve. Examples include original deposits, normalized versions of those deposits, normalized versions of processed datasets, technical documentation in durable formats such as TIFF or plain text, metadata in DDI XML, and so on. If a particular study (collection of content) has been through ICPSR's pipeline process N different types, say due to updates or data resupplies, then there will be N different versions of the content in archival storage.

Access holdings consist of only the latest copy of an object, and often include formats that we do not preserve. For example, while we might preserve only a plain text version of a dataset, we might make the dataset available in contemporary formats such as SPSS, SAS, and Stata to make it easy for researchers to use. Anything in our access holdings would be available for download on our Web site, and therefore doesn't contain confidential or sensitive data. Much of the content, particularly more modern files, would have passed through a rigorous disclosure review process.

The primary location of ICPSR's archival storage is a EMC Celera NS501 Network Attached Storage device. In particular, a multi-TB filesystem created from our pool of SATA drives provides a home for all of our archival holdings.

ICPSR replicates its archival storage in three locations:

San Diego Supercomputer Center (synchronized via the Storage Resource Broker)
MATRIX - The Center for Humane Arts, Letters, & Science Online at Michigan State University (synchronized via rsync)
A tape backup system at the University of Michigan (snapshots)

We are also working on adding a fourth replica at the H. W. Odum Institute for Research in Social Science at the University of North Carolina - Chapel Hill.

Some of our content stored at the San Diego Supercomputer Center - a snapshot in time from 2008 - is also replicated in the Chronopolis Digital Preservation Demonstration Project, and that gives us two additional copies of many objects.

An automated process compares the digital signature of each object in archival storage and compares it to a digital signature calculated "on the fly." If the signatures do not match, the object is flagged for further investigation.

The primary location for ICPSR's access holdings is also the EMC NAS. But in this case, the content is stored on a much smaller filesystem built from our pool of high-speed, FC disk drives.

ICPSR replicates its access holdings in five locations:

San Diego Supercomputer Center (synchronized via the Storage Resource Broker)
A tape backup system at the University of Michigan (snapshots)
A file storage cloud hosted by the University of Michigan's Information Technology Services
An Amazon Web Services (AWS) Elastic Computing Cloud (EC2) instance located in the EU region
An Amazon Web Services (AWS) Elastic Computing Cloud (EC2) instance located in the US region

Only the last replica above contains the necessary software and support systems (e.g., an Oracle database system) to actually deliver ICPSR's content; all of the other systems contain a complete snapshot of our access holdings, but not the platform with which to deliver the content.

The AWS-hosted replica has been used twice so far in 2009. We performed a "lights out" test of the replica in mid-March, and we performed a "live" failover due to another power outage in May. In both cases the replica worked as expected, and the amount of downtime was reduced dramatically.

And, finally, our access holdings and our delivery platform are available on the ICPSR Web staging system. But because the purpose of this system is to stage and test new software and new Web content, this is very much an "emergency only" option for content delivery.

Friday, November 13, 2009

TRAC: C1.7: Refreshing/migrating content

C1.7 Repository has defined processes for storage media and/or hardware change (e.g., refreshing, migration).

The repository should have triggers for initiating action and understanding of how long it will take for storage media migration, or refreshing — copying between media without reformatting the bitstream. Will it finish before the media is dead, for instance? Copying large quantities of data can take a long time and can affect other system performance. It is important that the process includes a check that the copying has happened correctly.

Repositories should also consider the obsolescence of any/all hardware components within the repository system as potential trigger events for migration. Increasingly, long-term, appropriate support for system hardware components is difficult to obtain, exposing repositories to risks and liabilities should they chose to continue to operate the hardware beyond the manufacturer or third-party support.

Evidence: Documentation of processes; policies related to hardware support, maintenance, and replacement; documentation of hardware manufacturers’ expected support life cycles.

ICPSR's archival storage consumes less than 6 TB of storage today. Over the past month we've made copies in other locations, and the time to copy it across a network is anywhere from a day to a week, depending upon the speed of the network. So that's much shorter than the lifespan of the media. :-)

The master copy resides on an EMC Celera NAS. From time to time one of the SATA drives that underpins archival storage will fail, and the Celera will fail over to its hot spare, and make a phone call for EMC to schedule a replacement. And, so in some odd way, the media gets refreshed on an incremental basis slowly over time.

We bought our Celera in 2005, and my expectation is that we'll likely replace it with something else in 2010; 2011 at the very latest. And so it's timely to start thinking about a written procedure for moving the master copy of the content from the Celera to the next storage platform. I don't think it will be a complicated procedure, and putting it together might make for a good future post.

Friday, November 6, 2009

Back to the Fedora: Part 4

This is the final post in the series.

So far we have introduced a pair of Content Model objects: one for social science data, and one for social science data documentation. In this post we introduce a third Content Model object for social science: an aggregate level object that has some content of its own (descriptive metadata and preservation metadata), but serves largely to group together related objects.

The Content Model object is to the left. It must have two Datastreams: one for the descriptive metadata in DDI XML format, and one for preservation metadata in PREMIS XML format. Note that we may discover that we can use DDI for both purposes, and in that case, the PREMIS Datastream will drop out as a required element.

Like past posts, the image to the left is a link to the ICPSR Fedora test repository, and will return the "home page" for the Content Model object pictured.

To the right we have a Fedora data object which conforms to the Content Model above.

Like the Content Model image, this image is also a link to our Fedora test repository, and clicking it will navigate to the matching data object.

This object has one relationship asserted per member object. In this case we assert three hasMember relationships: one for the survey data object; one for the survey documentation object; and, one for the survey instrument object. These correspond to isMemberOf relationships asserted in those objects, and together they assert a series of bilateral relationships.

The object contains the two required Datastreams. In this case the actual XML is somewhat stylized, and may not be "clean" XML. In particular the PREMIS Datastream is very much a work in progress here at ICPSR, and may bear little resemblance to high-quality PREMIS XML.

Thursday, November 5, 2009

SUMIT 2009 followup

This is a follow-up post to my short piece on SUMIT 09, the U-M IT security symposium.

The talk by Moxie Marlinspike was really, really good, and pretty scary. I found a copy of his presentation on the Black Hat site, and while you won't get his commentary by just looking through the deck, you'll definitely come to understand how weak many implementations of SSL are (were?), and how Moxie was able to exploit them. If you have traditionally felt pretty secure when using a web site via SSL, make heavy use of software with automated updates and downloads (like Mozilla), or think you can avoid problems by typing the URL into the address bar v. clicking links on web pages, this will make you reconsider your position.

I also started poking around his web site, thoughtcrime.org, and highly recommend reading some of his stories. I've read all but a few, and most have been pretty interesting. Not at all techie stuff; just good reads.

Wednesday, November 4, 2009

TRAC: C1.6: Reporting and repairing loss

C1.6 Repository reports to its administration all incidents of data corruption or loss, and steps taken to repair/replace corrupt or lost data.

Having effective mechanisms to detect bit corruption and loss within a repository system is critical, but is only one important part of a larger process. As a whole, the repository must record, report, and repair as possible all violations of data integrity. This means the system should be able to notify system administrators of any logged problems. These incidents, recovery actions, and their results must be reported to administrators and should be available.

For example, the repository should document procedures to take when loss or corruption is detected, including standards for measuring the success of recoveries. Any actions taken to repair objects as part of these procedures must be recorded. The nature of this recording must be documented by the repository, and the information must be retrievable when required. This documentation plays a critical role in the measurement of the authenticity and integrity of the data held by the repository.

Evidence: Preservation metadata (e.g., PDI) records; comparison of error logs to reports to administration; escalation procedures related to data loss.

My sense is that this requirement is just about policy as it is process. Fortunately for our data holdings (but unfortunately for TRAC preparation), data loss or corruption is a very infrequent event, and therefore as one might expect, the set of policies and written processes is pretty small.

As a point of comparison, if we look at our policies and processes for handling loss with "working files" we will find a much richer set of policies and systems. We have established infrastructure (an EMC Network Attached Storage (NAS) storage applicance and associated Dell tape management solution); we have internal policies and processes that document how to retrieve lost content; we have external policies that describe which parts of the NAS are written to tape, and the schedule of tape backups; and, we exercise the system on a regular basis as people inadvertently delete or damage files with which they are working actively.

On the Archival Storage side - or even the Access side, where we also look for loss and corruption - the number of data loss or data corruption events is very, very low. Email reports come out on a regular basis, but they always (almost) say that everything is fine. And on that rare occasion where there is an issue, the remedy is quick.

Perhaps the right solution here is to use the small sample of issues that have arisen over the years as our baseline for writing up a process, and then posting that process on our internal web site. That would be easy to do. But then the concern is this: If a policy is used very, very infrequently, it is likely to fall into disrepair. It is also likely to become forgotten. Maybe the tool that examines for loss or corruption should also contain a link to the relevant policies and recovery processes?

What strategies have others used to address this TRAC requirement?

Monday, November 2, 2009

Confidential Data and the Cloud

I have a new post on our NIH Challenge Grant project, but it's in our project blog rather than here.

So for you loyal readers who follow this blog, but not our Challenge Grant blog, here's the link: http://enclavecloud.blogspot.com/2009/11/high-level-system-architecture.html

I'll also be giving a talk on this at the Fall 2009 Coalition for Networked Information (CNI) Membership Meeting. If you're there, please drop by to say hello!