Technology at ICPSR: June 2011

Monday, June 27, 2011

Watching the Intercloud drift by

I really like this new graphic from DuraCloud

The term Intercloud doesn't get used as often as the term Internet does today, but they may change in a few more years.

Just as the Internet was a "network of networks" the Intercloud is supposed to be a "cloud of clouds." But is that supposed to mean?

One possible "cloud of clouds" is what DuraSpace is doing with their DuraCloud service. ICPSR was a pilot tester of DuraCloud, and we will soon sign up as a customer for the newly available DuraCloud production service.

As the nice graphic from DuraCloud makes clear, moving a document into the DuraCloud "cloud" really places it into a collection of other "clouds" as well, making DuraCloud a "cloud of clouds." From the point of view of a customer like ICPSR, we view DuraCloud as a single location for content with a single interface, a single bill, and a single help desk. But behind the scenes DuraCloud makes use of other clouds for its storage, such as Amazon's Simple Storage Service (S3) and Rackspace.

And, it is also easy to imagine future "cloud providers" sitting behind DuraCloud where the cloud provider is itself a "cloud of clouds." For example, the folks behind Chronopolis, a "cloud of clouds" itself with storage locations at the San Diego Supercomputer Center, the National Center for Atmospheric Research, and the University of Maryland's Institute for Advanced Computer Studies, have announced their intention to be one of the storage providers behind DuraCloud. And so by putting our content into DuraCloud, we may one day also be putting it into Chronopolis, which in turn means putting a copy into the storage clouds at SDSC, NCAR, and UMIACS.

Friday, June 24, 2011

TRAC: B5.2: Descriptive metadata and the AIP

B5.2 Repository captures or creates minimum descriptive metadata and ensures that it is associated with the archived object (i.e., AIP).

The repository has to show how it gets its required metadata. Does it require the producers to provide it (refusing a deposit that lacks it) or does it supply some metadata itself during ingest?

Associating the metadata with the object is important, though it does not require a one-to-one correspondence, and metadata need not necessarily be stored with the AIP. Hierarchical schemes of description allow some descriptive elements to be associated with many items. The association should be unbreakable—it must never be lost even if other associations are created.

Evidence: Descriptive metadata; persistent identifier/locator associated with AIP; system documentation and technical architecture; depositor agreements; metadata policy documentation, incorporating details of metadata requirements and a statement describing where responsibility for its procurement falls; process workflow documentation.

ICPSR requires very little metadata from producers. We do need the essentials: What does this two digit number in columns 44 and 45 mean? But we do not require producers to provide other types of metadata that might be useful for building finding aids. The production and review of metadata is a primary output of our workflow.

Metadata tends to reside in two different places. Some of it is co-located with the data it documents, and people would find it in the codebook or in a DDI XML description of the study or dataset. In other cases the metadata resides in a different location, a relational database (to support dissemination) or in DDI XML files (to support archival storage).

ICPSR has many (maybe even all of) the items listed in the evidence section above. We have depositor agreements. We have (lots of) process workflow documentation. We have descriptive metadata.

Friday, June 17, 2011

TRAC: B5.1: Metadata for discovery and identification

B5.1 Repository articulates minimum metadata requirements to enable the designated community(ies) to discover and identify material of interest.

Retrieval metadata is distinct from metadata that describes what has been found. For example, in a library we might say that a book’s title is mandatory, but its publisher is not, because people generally search on the title.

A repository does not necessarily have to satisfy every possible request, but must be able to deal with the types of request that will come from a typical user from the designated community(ies). The minimum requirements must be articulated. The minimum may be nothing more than an identifier the designated community(ies) would know and use to request a deposited object.

Evidence: Descriptive metadata.

Have we got descriptive metadata? Oh, yeah!

One of the main work processes at ICPSR is creating and editing descriptive metadata. In addition to keeping archival snapshots in DDI format, we also expose it via the home page for a study, the facets that appear in the search engine, and in the technical documentation we make available for download.

Friday, June 10, 2011

TRAC: B4.5: Archival storage records

B4.5 Repository has contemporaneous records of actions and administration processes that are relevant to preservation (Archival Storage).

These records must be created on or about the time of the actions they refer to and are related to actions associated with archival storage. The records may be automated or may be written by individuals, depending on the nature of the actions described. Where community or international standards are used, such as PREMIS (2005), the repository must demonstrate that all relevant actions are carried through.

Evidence: Written documentation of decisions and/or action taken; preservation metadata logged, stored, and linked to pertinent digital objects.

If I was to map ICPSR's business processes to the OAIS reference model, the vast majority of the active work would take place in the Ingest area, and most of the resources of ICPSR would be devoted to that activity - moving content through its initial deposit and into Archival Storage and Dissemination.

The next major area that gets attention and resources would be Dissemination. A significant number of people prepare content for delivery on our web site, and ICPSR also employs several software developers and web designers who build (we hope!) better and better systems for making social science datasets and documentation available to the research community.

And so the number of actions and administrative processes that we apply to content in Archival Storage is relatively small. Except for the major migration in media described in earlier posts, items in Archival Storage don't tend to change much, and we don't tend to perform all many functions on them, except for the regular fixity checks. Most of the actions, I believe, have been taken at the meta-level rather than the object level, such as standing up additional storage locations to that objects are stored in N + 1 locations rather than N.

Given that the actions are infrequent and apply more to the infrastructure comprising Archival Storage rather than the objects in Archival Storage, we've tended to document actions, decisions, and changes manually and in narrative rather than in some more automated way using a machine-actionable format, such as an XML schema like PREMIS. For instance, when we added our newest Archival Storage location, we described this in a post on our Intranet.

How have others tended to document such decisions and actions?

Wednesday, June 8, 2011

New technology for our Research Connections property

Most of the new technology is behind the scenes, but ICPSR deployed a major update to the content and metadata curation system that sits behind our Research Connections web site this Monday.

The old curation system used an older search technology that ICPSR never really embraced: Oracle Text Search. Our experience with this technology was very negative. My sense is that when we first began using the technology in 2005-2006, it was not stable, and had not been tested rigorously across the most common platforms. We found it difficult to open trouble tickets and cases with Oracle, and even when we were successful at that, we found that they were slow to provide a fix.

We replaced most of our Oracle Text Search in 2009 and 2010, moving to the Lucene search engine from the Apache project. Our experience there was very different, and somewhat ironic: the level of support and quality of software was much higher for "unowned" open source software than it was for a commercial product from a vendor. And now we have been able to replace the search we use in the curation system with Lucene too.

Moving to Lucene also allowed us to decommission a large corpus of kludges we had put in place to make Oracle Text Search to work. For example, we found that the Oracle Text Search parser did not do a very good job indexing PDF-format documents; it would silently fail, and so our index was never complete and correct. So we built a system - it could have been designed by Rube Goldberg himself - which continually watched for new PDF documents to appear, converted them to text, updated the Oracle Text Search index, checked the index for correctness, and then moved the new index into production. And then it started the cycle again. No one will miss this piece of software.

Monday, June 6, 2011

One week later - Facebook + Google v. MyData

ICPSR rolled out a new option for logging into our web site on May 18. Instead of requiring web site visitors to create a MyData account (using an email address) and a password, we deployed a new service whereby web site visitors can use their existing Google ID or Facebook ID to login. And, if they have already logged into Google (or Facebook), then there is no need to login again to ICPSR. (This type of feature is often called SSO for Single Sign-On.)

I was curious to see how many people were taking advantage of this new service, and so I pulled a week's worth of information about newly created profiles. (We always create a profile for a new ICPSR web site user, regardless if s/he uses Facebook, Google, or MyData.) The specific time period covered was May 19, 2011 through mid-morning May 26, so just a bit over a week.

During that time there were nearly six hundred new profiles created, and a bit over two-thirds took advantage of the new SSO service. Here are the numbers:

So, 399 new profiles created that used either Google or Facebook to log in to ICPSR's web site, and 171 new profiles where the person created a MyData account.

I wasn't surprised to see the non-MyData profiles to be higher than the MyData profiles, but I was surprised that the difference wasn't even more pronounced. My guess would have been that most people have either a Facebook or Google ID (or both), and that most people would rather use that ID for logging into a system rather than taking the time to set-up a MyData account, especially since we've seen that most web site visitors don't come back very often (or at all).

Maybe, though, this very transient usage is what leads some people to create the MyData account. Perhaps they view the account is one which is completely disposable, and they don't care if they don't remember the account or password? And perhaps they would prefer to keep their reusable identity (from Facebook or Google) separate?

Friday, June 3, 2011

TRAC: B4.4: Monitoring integirty

B4.4 Repository actively monitors integrity of archival objects (i.e., AIPs).

In OAIS terminology, this means that the repository must have Fixity Information for AIPs and must make some use of it. At present, most repositories deal with this at the level of individual information objects by using a checksum of some form, such as MD5. In this case, the repository must be able to demonstrate that the Fixity Information (checksums, and the information that ties them to AIPs) are stored separately or protected separately from the AIPs themselves, so that someone who can maliciously alter an AIP would not likely be able to alter the Fixity Information as well. A repository should have logs that show this check being applied and an explanation of how the two classes of information are kept separate.

AIP integrity also needs to be monitored at a higher level, ensuring that all AIPs that should exist actually do exist, and that the repository does not possess AIPs it is not meant to. Checksum information alone will not be able to demonstrate this.

Evidence: Logs of fixity checks (e.g., checksums); documentation of how AIPs and Fixity information are kept separate.

ICPSR calculates a fingerprint (via MD5) on each object as it enters archival storage. This fingerprint is stored in a relational database, and is keyed to the location of the object. The object itself is stored in a conventional Linux filesystem. The filesystem and the relational database's storage area are on physically different systems.

A weekly job runs through the list of objects in the database, and it compares the stored fingerprint to one it calculates (on the fly) for the object. If there is a mismatch, the job reports an error.

Regardless of whether or not there is an error, the weekly job sends a status report.

In the case of error, a systems administrator investigates almost immediately. (We check fixity over the weekend; a problem might not be investigated until Monday morning.) The most common error we have seen so far has been due to errors in the ingest process. A typical scenario is that a data manager submits an object for ingest, and the job fails for some reason, leaving the database and the object store out of sync. A typical remedy is that a systems administrator isolates and corrects the fault (or the data managers corrects the content), and the job is resubmitted. If any partial content (in either the database or the object store) has gotten stuck, the system administrator clears it. Again, in the most typical case, the system administrator would discard the "stuck" content, and the data manager would resubmit the objects for ingest.

Wednesday, June 1, 2011

ICPSR NSF INTEROP project heads to Vancouver for IASSIST 2011

Here is a sneak preview at the poster we've put together for the IASSIST 2011 conference for our NSF INTEROP EAGER project investigating Fedora as a repository solution for social science datasets.

Mary Vardigan, Assistant Director of ICPSR, and Director of Collection Delivery, will be presenting the poster at the IASSIST 2011 conference's poster session on Thursday, June 2, 2011.

In the poster we've tried to capture the thinking that went into our object model for social science research data and documentation, and how we mapped that to Fedora's Content Model Architecture. We also highlight the other two deliverables of the project, a tool for generating FOXML-format objects for ingest into Fedora, and a suite of still-evolving services that may be applied to the objects.