Google+ Followers

Wednesday, November 30, 2011

ICPSR and the cloud

How is ICPSR using "the cloud?"

I've been getting this question a lot lately, and it feels like it's time to put together a blog post on this question.

From a functional standpoint ICPSR is using the cloud for identity and authentication, content delivery, archival storage, and data producer relationship management.  And if I include services based at the University of Michigan, I might also include data curation, and customer relationship management.

From a vendor standpoint here's a roster of some of the organizations with whom we're doing business, and how their piece of the cloud helps us run our business.

A typical transaction on the ICPSR web site looks like this:  Search.  Select content for download.  Create an ICPSR-specific identity.  Authenticate using that identity.  Download the content.  Do not return to ICPSR for at least a year.

Given that the ICPSR-specific identities are weak (i.e., web site visitors create them by entering an arbitrary email address and password) and given that they identity is often used only once, it seemed like a good idea to eliminate the need to create such an identity.  We don't need strong identities, but we do need identities that would be available to anyone.  Technologies like OpenID, Facebook Connect, and the like seemed promising, but who wants to build infrastructure which talks to all of them?

Janrain does.

We use Janrain Engage as one part of our identity and authentication strategy.  Janrain acts as a third party between the content provider (ICPSR) and the identity providers.  And so when someone needs to log in to ICPSR's portal, they see a screen that looks something like this:

So there's no need to create an account and password at ICPSR.  And if someone does return later, they don't have to log in to our site if they've already logged in to their identity provider's site.  (This is Single Sign-On or SSO.)

We're hosting several web portals in Amazon's cloud.  We're using Amazon's Infrastructure as a Service (IaaS) to stand-up Linux systems in the Amazon Elastic Computing Cloud (EC2) that are identical to the images we host locally.  We back the instances with Elastic Block Storage (EBS) volumes so that the content persists when we need to terminate and restart a computing instance. 

We also host a replica of our on-site delivery system in Amazon's cloud for disaster recovery (DR) purposes.  We find that we have the opportunity to "test" this replica at least once per year when ICPSR's headquarters loses power for several hours due to high winds, ice storms, or other acts of nature.

The Amazon service has been very reliable overall (despite a few highly publicized events), and certainly more reliable than our own on-site facilities.  We also like that we can scale resources up and down very quickly, and that we have clear costs associated with the infrastructure.  (Anyone at an institution of higher learning who has tried to calculate the actual cost of electricity used knows what I mean.)

I've posted many times about our relationship with DuraCloud, and how we're using it as a mechanism for storing archival copies in the cloud.  In many ways DuraCloud fulfills a role similar to that of Janrain Engage by providing a layer of abstraction between ICPSR's technical infrastructure and that of multiple service providers.  In this case we manage one vendor and one set of bills, but have the ability to store content in the cloud storage service of multiple providers (Amazon, Rackspace, Microsoft).

The acquisitions team at ICPSR keeps an eye on grants funded by places like the National Science Foundation and the National Institutes of Health.  If a grant looks like it may be producing data the team makes a note to contact the primary investigator (PI).  The goal is to have a conversation with the PI to see if there will indeed be data produced, and to see if it might be a good fit for ICPSR's holdings.  If so, we then try to convince the PI that depositing the content with ICPSR would be good for everyone (more data citations for the data producer; more re-use of the data by other researchers; etc.).

We had been using a home-built application to manage this content, but we found it to be a losing battle.  There was never enough money or time to build the types of relationship management reporting systems that the acquisition team wanted.  And so rather than trying to build a better mousetrap, we decided to rent a better mousetrap by moving the content into a professional contact/customer relationship management (CRM) system.  Like Salesforce.

 The University of Michigan central IT organization (ITS)  also delivers a handful of services that I would consider "the cloud" even though they do not package and market them that way.  File storage, trouble ticketing, and Drupal-hosting are all available from ITS, and they all look like cloud services to us because we pay for only what we use, we can scale them up and down reasonably quickly, and we do not have to deploy any local hardware or software to use them.

1 comment:

  1. It's also worth noting that ICPSR uses RegOnline for managing event registration activities.