Technology at ICPSR: December 2010

Friday, December 10, 2010

ICPSR Technology Resources

I have been working on some boiler-plate text that we plug into grant and contract applications, and it felt like it might make for a mildly interesting blog post. We are also working on a nice diagram to accompany this narrative, and once that is put together, I'll pass that along too. Here it is:

ICPSR operates an extensive computing environment. Our machine room contains dozens of multi-core Linux servers supporting the curation, preservation, and dissemination of social science research data and documentation. Over 120 desktop workstations access these servers and our EMC NAS storage system, which features over 50 terabytes of capacity. Oracle is our enterprise database management system, and we support all major statistical packages. We are connected to high-speed national and regional data networks via redundant multi-gigabit circuits, and the University of Michigan monitors our production computing systems 24 hours/day, seven days per week, notifying an on-call systems engineer in the event of system degradation.

ICPSR takes advantage of its position within the University of Michigan to gain access to a variety of systems and services, such as preferential software licensing terms, help desk and trouble ticket systems, network vulnerability scanning, desktop virtualization services, managed networks and firewalls, hosted content management systems, collaboration and courseware (Sakai) systems, and desktop workstation software and patch management systems.

ICPSR makes regular use of commercial computing clouds to provision elements of its cyberinfrastructure, particularly services that require high availability for delivering public-use content. ICPSR also has regular access to special-purpose, cloud-based infrastructure (Chronopolis, DuraCloud) to support its digital preservation mission.

Wednesday, December 8, 2010

DuraCloud pilot update - December 2010

I'm getting together with my DuraCloud pilot colleagues the morning before CNI starts. I just saw the agenda for the meeting, and it looks like it will be a fruitful and interesting set of conversations.

I also had a very nice chat with Carol Minton Morris from Duraspace about our expectations from the DuraCloud pilot project, and our experience to date. You can find her write-up of that conversation here on the DuraSpace blog.

Friday, December 3, 2010

TRAC: B2.8: Capturing Representation Information

B2.8 Repository records/registers Representation Information (including formats) ingested.

When international standards for the associated Representation Information are not available, the repository needs to capture such information and register it so that it is readily findable and reusable. Some of it may be incorporated into software. The Representation Information is critical to the ability to turn bits into usable information and must be permanently associated with the Content Information.

Evidence: Viewable records in local format registry (with persistent links to digital objects); local metadata registry(ies); database records that include Representation Information and a persistent link to relevant digital objects.

As noted in last week's post, we capture representation information in both IANA MIME type form and also in a more human-readable form. We are also looking at adding an additional piece of representation information to the metadata that surrounds the files deposited at ICPSR: file type or file role.

In brief, the idea is to capture the high-level concept behind the role the file plays in research. For example, it may be nice to know that a given file is an Excel workbook, but it is also important to know whether the file contains data, documentation, a database of sorts, or some combination of things. An Excel file that contains nothing but columns of numbers and text might be normalized quite easily into a more durable format. An Excel file that contains nothing but text and images and descriptions of a data file might be converted to PDF/A or TIFF or some other format.

This idea has been used with the derived content that ICPSR produces for a very long time, but ICPSR is just now exploring the required changes in business process to do this for deposited files as well. More as this story develops....

Wednesday, December 1, 2010

ICPSR Deposits - November 2010

Here is a snapshot of the types of deposits and deposited files that ICPSR received in November 2010.

My sense is that this was a fairly typical month both in terms of volume and the types of files. We always get lots of documentation and related text in one of the common formats, such as PDF and MS Word. And the main sources of data come in one of the big stat packages, like SAS and SPSS, or in plain text.

# of files	# of deposits	File format
3	2	application/msoffice
120	14	application/msword
202	33	application/pdf
3	2	application/vnd.ms-excel
22	6	application/x-sas
40	13	application/x-spss
3	1	application/x-stata
2	2	message/rfc822\0117bit
1	1	text/html
3	1	text/html; charset=utf-8
5	5	text/plain; charset=iso-8859-1
16	8	text/plain; charset=unknown
110	31	text/plain; charset=us-ascii
48	1	text/plain; charset=utf-8
1	1	text/rtf
1	1	text/x-c++; charset=us-ascii