Technology at ICPSR: April 2009

Friday, April 10, 2009

Fedora at ICPSR

We've been spending a lot of time this month at ICPSR working with the Fedora digital repository software. It looks like a lot of people have downloaded the software (the Fedora Commons web site says that there have been over 25,000 downloads), but the number of organizations using it is still relatively small compared to the number of downloads, less than 200. I suspect that some of those listed organizations, like ICPSR, are using Fedora in sand box environments rather than for production use.

For those not familiar with Fedora, it is an acronym which stands for Flexible Extensible Digital Object Repository Architecture. It's software that was developed originally at Cornell in the late 1990's and is freely available today. After exploring the system for a few weeks now, we'd certainly agree that the name is apt.

One, this really is architecture level software, sometimes called middleware. It isn't a finished content management system (CMS), say like Drupal or Plone, where one performs a quick install and then starts adding content. The software does come with a few demonstration objects already loaded, and a handful of basic mechanisms to browse and display the content, but ultimately one needs to build a "stack" on top of Fedora to really use the system in the way it was intended. This means building your own stack, or using one of the ready-made stacks such as Fez, Muradora, Islandora, and so on. In addition to test-driving Fedora, we've also been looking at the stacks.

Two, the system does force one to think seriously about the precise nature and shape of one's digital content. What are the atomic objects and what are the higher-level digital object molecules they form? What's actually a relationship between two objects v. an attribute of an object? What's the essential content that should be preserved, and what is merely a derivative? If one has a bunch of image-oriented content, there are lots of good examples available for how one might decide to organize the objects; if one has a bunch of social science data, the examples aren't as applicable (but are instructive nonetheless). This is indeed extremely flexible and extensible stuff.

One small complaint I have is with the name: When Thorny Stapes from the Fedora Commons visited us in the fall, he told us the story about how the Cornell folks got the name first, and how they reached a compromise with the Red Hat guys when they created their Fedora distribution of Linux. But the problem for folks like me is that it makes it very hard to find web pages and information about the Fedora repository software without some heroic efforts with search engines. For example, I use Google Alerts as a mechanism to collect information about items of interest, and my query for Fedora is longer than the rest of my Google Alerts combined. And it also turns out that there is a pretty popular college football coach who also has the name Fedora!

Up next: Short summaries of the three Fedora-related projects underway today.

Wednesday, April 1, 2009

Weather report at ICPSR? Cloudy

We've been really impressed with the services available from Amazon Web Services. Amazon makes a handful of very nice Firefox Add-ons available that make it very easy to start using their services. Elasticfox is the main one we've been using; it takes only a few moments to install it, configure it with one's AWS credentials, and then to start launching and managing virtual machines. Of course, Amazon also makes it very, very easy to start using their AWS service. All one needs to do is create an account and supply a credit card number.

In February we upgraded to the latest version of Autonomy (nee Verity nee Inktomi) Ultraseek for our study/web search. We installed the software on a virtual machine in Amazon's Elastic Computing Cloud (EC2), their "computer virtualization" service so that we could make use of the search both from our primary web location in Ann Arbor as well as our back-up web locations in EC2 and (soon) San Diego. (More on the back-up web locations in a later post.)

No dobut this'll jinx the search service, but since we moved to the EC2 platform the search has worked very well; no service interruptions, no downtime. And keeping the search and index in one location (Amazon's cloud) and the content is it indexing in a different location seems to work just fine.

In addition to the production study/web search capability and our back-up web location, we've also been using EC2 for ad hoc, one-off computing requests. In the past we would have purchased new hardware, or re-purposed under-powered desktop hardware, but now we just launch the right sized virtual machine in the cloud, and we're ready to go in just minutes. And by making good use of Amazon's related products - Elastic Block Storage and Simple Storage Solution - we also have a reasonable disaster recovery story to tell.