Technology at ICPSR: January 2011

Friday, January 28, 2011

TRAC: B2.10: Processing deposits

B2.10 Repository has a documented process for testing understandability of the information content and bringing the information content up to the agreed level of understandability.

If Content Information or Preservation Description Information (PDI) is not directly usable by the current application tools of the designated community(ies), the repository needs to have a defined process for giving it usable form or for making additional Representation Information available (see B3.2).

Repositories that share the burden of ensuring that adequate metadata or documentation is captured or generated to meet a required degree of understandability can implement any number of procedures to address this requirement. Such repositories typically have a narrowly defined designated community, such as a particular science discipline.

Evidence: Retention of individuals with the discipline expertise; periodic assembly of designated or outside community members to evaluate and identify additional required metadata.

Disclaimer: I'm not sure that I fully understand this TRAC requirement, and my sense is that it is one of the few where I (as the "tech guy") might get a pass. But here goes....

This requirement seems to be asking the question: Are your customers and clients able to use the content you are making available? I think there are two different answers to this question.

One answer is an emphatic, Yes! At the aggregate level it seems clear that the content preserved and disseminated by ICPSR is useful to the community. If it wasn't, presumably this would result in the rapid erosion of the number of members, the number of datasets downloaded, and the general disuse of ICPSR as a resource for social science research. Why come get our stuff if it isn't useful to the community?

Another answer is the more equivocal, Probably. At the micro level of a particular study, it is easy to imagine that we have some content which is both accessed infrequently and where the metadata is somewhat sketchy. For example, imagine a study first processed by ICPSR in the 1970s, and which, for whatever reason, does not have modern "ready to go" formats or even modern setups for the common stat packages. It would still have data available (plain ASCII, maybe even in card format), and it would have some sort of associated codebook, even one in plan text format. Clearly this type of content would be much less usable - perhaps even unusable - by our clients.

That scenario raises the question: Is there a problem to solve? If ICPSR wants to serve its membership well over the long term, what is the right strategy for handling content which may have little value to the community (at least today)?

Wednesday, January 26, 2011

Updated search capability

One of the new things we're releasing in February is an improved search on the ICPSR web site. The change isn't in the way that people use the search; rather it is in how we build the index.

The ICPSR "study search" index has always used the rich set of metadata that our data managers create during the data curation process. This contains the usual items one might expect to find in an index, such as the name of the researcher, the name of the study, subject terms and headings, etc.

Our new index uses this same rich metadata, but also makes use of the full-text available in documents such as codebooks and survey instruments. Our preliminary findings have been very encouraging: In some cases, studies that would have been hard or impossible to find with a "metadata only" search appear high in the search results with a more broad index.

For example, if you were to search using the terms "warfare" and "africa" in our current search, you would end up with this URL in your browser's address bar:

http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies?archive=ICPSR&q=warfare+africa

and three results.

However, with the new search, you will end up with over 30 results, including many studies that are in the World Military Expenditures and Arms Transfers set. You can take a sneak peak at the new capability by adding the string "&newSearch=true" to the end of the URL in the address bar. For instance, to do the search above using the new index, use this URL:

http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies?archive=ICPSR&q=warfare+africa&newSearch=true

and see the difference.

Saturday, January 15, 2011

ICPSR Machine Room Work - Redux

The machine room power work has been completed, and we've been bringing our web services back on-line. The ICPSR and Research Connections web sites are back in full service now.

ICPSR Machine Room Work

Well, it has been an interesting morning so far....

If you have been trying to use our web site this morning, then you know it got off to a slightly rocky start. It turns out that an automated database copy between the primary server at ICPSR and the replica in the cloud malfunctioned in a new, unusual, and uncommon way. And, of course, it happened this morning. So while static web pages loaded just fine, content pulled from the database was not working fine at first. Sorry about that, folks.

We're still having some challenges with the index behind our Research Connections site. Part of the team has been checking into that while the rest has been moving equipment in our machine room. When the metaphorical smoke clears later this afternoon, we'll have a major new supply of power in the machine room, and that will allow us to stand-up a new UPS system later this month. And that will: (1) give us more head room when we use electrical service, (2) allow us to connect up some new systems that do not have UPS, and (3) get rid of a dozen old UPS systems, some of which we own, and some of which are rentals.

It sounds like we should be able to start restoring systems before noon today. I'll include a follow-up post once we're fully back in business.

Friday, January 14, 2011

TRAC: B2.9: Collecting Preservation Metadata

B2.9 Repository acquires preservation metadata (i.e., PDI) for its associated Content Information.

Preservation metadata (PDI) is needed not only by the repository to help ensure the Content Information is not corrupted (Fixity) and is findable (Reference Information), but to help ensure the Content Information is adequately understandable by providing a historical perspective (Provenance Information) and by providing relationships to other information (Context Information). The extent of such information needs is best addressed by members of the designated community(ies). The PDI must be permanently associated with Content Information.

Evidence: Viewable records in local format registry (with persistent links to digital objects); local metadata registry(ies); database records that include Representation Information and a persistent link to relevant digital objects.

My sense is that ICPSR is in pretty good shape on this item. We have an entire database schema devotes to our objects in Archival Storage, and we collect (and store) a great deal of information about content starting when a depositor first gives it to us (e.g., checksum via md5, identity of depositor, file format), and then extending throughout the ingest process. This content is readily available to internal staff through a no-frills, but extremely useful database browser that someone on my team built several years ago.

One thing that we don't do today is annotate original content at the file level with information about its role; I think this would fall within the scope of Context Information above. For instance, if someone deposits a file called MySurvey.xls, we do not add a comment that says, "Oh, this is survey data." I've been participating in meetings where this particular issue is getting some discussion, and my expectation is that sometime this year we (in the IT shop) will need to build some simple extensions to existing systems to implement features like this.

One bit of Context Information that we do collect (at the end of ingest) is a mapping at the aggregate level between original submissions to ICPSR (i.e., deposits) and the content we ingest and deliver (i.e., studies).

Thursday, January 13, 2011

December 2010 Deposits

# of files in this format	# of deposits containing this format	File format
3	3	application/msaccess
2	1	application/msoffice
45	14	application/msword
17	2	application/octet-stream
61	16	application/pdf
1	1	application/vnd.ms-excel
5	1	application/vnd.ms-powerpoint
119	7	application/x-sas
123	11	application/x-spss
105	4	application/x-stata
2	2	text/html
6	1	text/plain
18	2	text/plain; charset=unknown
445	8	text/plain; charset=us-ascii

December looks like a fairly typical month in terms of volume and formats: heavy on the usual stat packages and plain ASCII, and a fair number of files in PDF or one of the MS Office formats.

Wednesday, January 12, 2011

ICPSR Power Outage

The University of Michigan is upgrading the electrical system in the ICPSR machine room, and the final part of the project takes place this Saturday (January 15, 2011). The U-M electricians will need to "de-energize" the machine room at 9am EST, and their work might take up to eight hours to complete.

My team will be on-site earlier that morning to shut down all of our key systems, and to transfer ICPSR's public-facing services to our replica in Amazon's computing cloud. People using ICPSR's web site may experience brief interruptions in service as we transition control from the machine room to the cloud. This will take place between 8am and 9am EST.

Once the U-M electricians have completed their work, we will transfer services back to our primary site at ICPSR, and web site visitors may experience brief interruptions during that time too.