Technology at ICPSR: September 2009

Wednesday, September 30, 2009

TRAC: C1.1: Well-supported core infrastructure

C1.1 Repository functions on well-supported operating systems and other core
infrastructural software.

The requirement specifies “well-supported” as opposed to manufacturer-supported or other similar phrases. The level of support for these elements of the infrastructure must be appropriate to their uses; the repository must show that it understands where the risks lie. The degree of support required relates to the criticality of the subsystem involved. A repository may deliberately have an old system using out-of-date software to support some aspects of its ingest function. If this system fails, it may take some time to replace it, if it can be replaced at all. As long as its failure does not affect mission-critical functions, this is acceptable. Systems used for internal development may not be protected or supported to the same level as those for end-user service.

Evidence: Software inventory; system documentation; support contracts; use of strongly community supported software (i.e., Apache).

At the foundation of ICPSR's core technology infrastructure is Red Hat Linux. Linux, of course, is a very widely deployed open source variant of UNIX, and Red Hat is a world leader in supporting Linux. ICPSR had previously used proprietary operating systems, but moved all of its systems to Red Hat Linux in the decade.

Moving up in the stack of our core technology infrastructure we use several pieces of software from the Apache Software Foundation, a very large community of developers and users. In addition to the flagship HTTP Server, we also use the tomcat servlet container for all of our Java-based web applications, the Solr search engine which is built atop the Lucene Java search technology, and the cocoon framework for rendering XML into other formats.

ICPSR also builds and maintains a suite of custom software for key business processes, such as our Data Deposit Form (ingest), our download system (access), and our data processing systems (data management, ingest, access). While this software is necessarily proprietary, it is written in common, modern software languages such as Perl and Java, which have wide support in the community.

Like many, many enterprises, ICPSR uses Oracle as its database system. Given the large installed base of Oracle across the world, ICPSR views this as a well-supported platform. We believe we can continue to use Oracle as long as the University of Michigan continues to make it freely available to us. Further, we make use of only the most basic elements of Oracle, and if required, it would be straight-forward, but not insignificant, to migrate our content to any other relational database technology, such as postgres or mysql. Our one highly customized use of Oracle, and therefore the one that carries the most risk, is the OracleText-based search engine for the Child Care site, is scheduled to be replaced in early 2010.

And finally, we are actively migrating our own proprietary archival storage system to the Fedora system, which is supported by the newly created DuraSpace organization. Other than internally created systems, Fedora probably has the smallest community of support of any of our major technology systems, but because it is open source and gaining traction in the community, we believe its level of support will continue to grow over time. And further, because the underlying content resides in plain XML files, even with a sudden and catastrophic loss of Fedora, it would still be possible to migrate content to another system.

Monday, September 28, 2009

Trustworthy Repositories Audit & Certification: Criteria and Checklist:

We're working through the Center for Research Libraries (CRL) Trustworthy Repositories
Audit & Certification (TRAC) Criteria and Checklist. My primary responsibility is to review Section C, which focuses on technology. Having looked through the TRAC document several times now, my sense is that ICPSR has a pretty good story to tell, and the main task is to take the time to tell the story. I thought it might be interesting to begin a series of blog posts, one per item from Section C in the TRAC checklist.

I have three goals for the posts:

Tell ICPSR's story for each item so that we have transparent evidence and documentation in each area
Share ICPSR's evidence and documentation for others who are working through the TRAC document, or who are thinking about starting; I think it's often easier to work with something that someone has already begun v. starting from scratch
Solicit input from others who have completed TRAC certification: What have we missed?

There are ten items in Section C.1 and I'll see if I can't work through one item/week between now and the end of the calendar year. Look for the first item later this week.

And if you're working through TRAC yourself, I'd love to hear any advice you have (either as Comments here or via email).

More Grant News: IMLS National Leadership Grants

Micah Altman and Gary King of Harvard's IQSS put together a proposal to the IMLS to extend work on a LOCKSS-based archival replication system we started as part of our Library of Congress-funded Data-PASS project, and it looks like the IMLS will be funding it. A nice summary of recent grants is available on the IMLS web site, and also on the Library Journal web site. It wouldn't be right of me to mention the LOCKSS-based system without also mentioning the Roper Center and the Odum Institute as the other key players in building the initial prototype. It was a real team effort.

And while it isn't technology-oriented....

ICPSR picked up its own IMLS grant for rescuing "at risk" social science data. George Alter is the PI, and his project is another follow-on that leverages our Data-PASS partnership.

Friday, September 25, 2009

IASSIST 2010 Call For Papers

In case you haven't seen this already, here's the call for papers for the next IASSIST meeting:

IASSIST 2010

Social Data and Social Networking:
Connecting Social Science Communities across the Globe
1-4 June 2010
Ithaca, NY, USA

IASSIST 2010, the 36th Annual Conference of the International Association for Social Science Information Service and Technology (IASSIST) will be hosted by the Cornell Institute for Social and Economic Research (CISER) and Cornell University Library (CUL) and will be held at Cornell University, in Ithaca, New York, USA, on 1-4 June 2010.

The theme of this year's conference is Social Data and Social Networking: Connecting Social Science Communities across the Globe. Social science has begun to feel the impact of the dramatic shift in communication patterns globally, where social networking and other digital media trends are changing how social scientists study the world around them. This theme is intended to stimulate discussion about the impact of social networking on the creation, collection, sharing, storage, preservation, dissemination, confidentiality, licensing of, and access to data. Of particular interest is how social connectivity has facilitated multi-site and cross-national social science research.

A webform for submission of proposals will be available on the conference web site: http://ciser.cornell.edu/IASSIST/ from 12 October 2009.

Deadline for submission: 30 November 2009.

Notification of acceptance: 1 February 2010.

For more information about the conference, including travel and accommodation, see the attached PDF Call for Papers or visit the conference web site at:
http://ciser.cornell.edu/IASSIST/ .

IASSIST is an international organization of professionals working in and with information technology and data services to support research and teaching in the social sciences. Typical workplaces include data archives/libraries, statistical agencies, research centers, libraries, academic departments, government departments, and non-profit organizations, see http://www.iassistdata.org for further information.

As usual, I suspect that ICPSR will submit several presentations and papers for the conference. I think an update on either (or both) of the recent tech-oriented grants might be appropriate: using the cloud to deliver sensitive and under-utilized data, and using Fedora to preserve and delivery social science datasets and documentation.

Thursday, September 24, 2009

Designing Storage Architectures for Digital Preservation - Day Two

[ Due to a combination of my own stupidity and the way in which Blogger does (or doesn't!) do auto-save, many of my notes for the second day disappeared sometime between DCA and DTW. So a very abbreviated set of notes for Day Two. ]

The first session, Data Integrity, began at 9:00am with a series of vendor presentations.

Henry Newman, Instumental: small market for digital preservation-quality systems. Disk density and transfer rate have outpaced reliability with disks. Still need tapes due to low power and high capacity and high reliability. HSM lacks broad market acceptance. Asserts that loss of a single bit is catastrophic. Mismatch between preservation requirements and main market requirements.
David Rosenthal, LOCKSS. Different usage patterns between content going into archives, and content retrieved on a regular basis. Designs and thinking need to take account of this.
Ray Clarke, Sun Microsystems. Draws distinction between backups v. archiving (preservation). Content growth exponential. Most data in archives is used infrequently. Asserts the tape "continues to make sense" for preservation: power, portability, etc. Humans introduce most errors, and so need to future-proof.
Mike Mott, IBM, spoke how some loss in some contexts is acceptable. Need to hit the "utility" number. Shared some stories from the past about needing to solve error detection and correction problems in the end-to-end system, not just within each component.
Tim Harder, EMC, High-Assurance and Integrity Layer. "Law of Large Numbers is not on your side." Described approach similar to LOCKSS and DuraCloud. Use sampling to validate correctness of data.
Paul Rutherford, Isilon, failures = disk drive, controller/node, human. Do not trust storage. Do not trust yourself. Need to recover from failures fast enough. Overall system must be available in face of failure in large components. "RAID is dead." Not good enough. "We called it 'grid' before 'cloud'."

Tuesday, September 22, 2009

Designing Storage Architectures for Digital Preservation - Day One

This is the first day of a two-day workshop on storage architectures for digital preservation. The workshop is hosted by the Library of Congress at the Churchill Hotel in Washington, DC. There are about eighty or so attendees, many from the LoC itself, but also many "tool makers" (Sun, EMC, Seagate, Cisco, etc.) and "data stewards" (ICPSR, MetaArchive, HathiTrust, etc). My apologies in advance where I have misunderstood or misquoted a speaker below.

The workshop began at noon with a luncheon, followed by a brief Opening/Welcome session. This moved quickly to a 90-minute session, Storage for Digital Preservation: Panel of Case Studies from Users, which began at 1:15pm. There were eight, seven-minute presentations:

Thomas Youkel, Library of Congress, cited some numbers about the amount of content ingested by the LoC, and the amount they project they will ingest during the next two years. One interesting figure is that the LoC ingested 24.7 TB during the week of June 2, 2009. He also described data integrity as a key challenge, and workflow, content management, and migration as secondary challenges.
David Minor, San Diego Supercomputer Center, gave a brief overview of the Chronopolis project: three partners (SDSC, NCAR, UMIACS), 50TB of storage at each node. SRB is the content transport system; BagIt is the content container; and, ACE is the content integrity system. ICPSR, the California Digital Library, the MetaArchive, and one other organization I missed are the content providers. Chronopolis Next Generation is seeking additional storage partners (nodes), migration tools, and connecting to other storage networks, such as the MetaArchive's private LOCKSS network.
Bill Robbins, Emory University, described how the MetaArchive was using an Amazon EC2 system as its central "properties server" to solve the (political? procedural?) issues of Emory serving as the "master node" for the MetaArchive. Bill had a good quote: "We're not cheap. We're 6x cheap, and that's not so cheap." Bill expressed general satisfaction with EC2, but wished the documentation was better.
Andy Maltz, Academy of Motion Pictures Arts and Sciences, reference the Digital Dilemna in his talk about the requirements his organization has for digital preservation solutions: (1) last 100 years; (2) survive benign neglect; (3) at least as good as photochemical; and, (4) cost less than $500/TB/year. Andy also referenced the phrase "Migration is broken" from a 2007 SNIA report. He cited some figures: a movie consumes 2-10PB of storage, and Hollywood produces about one move/day. He finished with a brief description of StEM, an NDIIPP-sponsored project.
Laura Graham, Library of Congress, described the LoC's efforts to preserve websites. The Internet Archive does the crawling, and the Wayback Machine is the delivery mechanism. A system at the LoC acts as archival storage. Wish list includes fewer manual steps in the system, and less of a need to copy files around quite so much.
John Wilkin and Corey Snavey, Hathitrust (and the University of Michigan Library), gave a brief overview of Hathitrust. They're leveraging the OAIS reference model, plugging in modular solutions wherever possible. 185TB of storage today. Focus is on the "published record." Corey asserted that data stewards will need to be able to rely more and more on the storage solution (trust) in order to succeed in their missions.
Jane Mandelbaum, Library of Congress, was the proxy for a very brief overview of the DuraCloud effort from DuraSpace. DuraCloud is essentially a middle layer between a variety of cloud storage providers and data stewards.
Jimmy Lin, representing Cloudera, described Cloudera as the RedHat for Hadoop. Jimmy went on to talk a bit about Hadoop, HDFS, and MapReduce, and how Cloudera might be a very attractive platform for connecting "compute" to storage.

The session concluded with a general conversation about key issues in digital preservation: trust; costs; not knowing what bits will be considered valuable up-front; and, how frequency of access is unknown. David Rosenthal had a good line: "You have to get used to the idea of losing stuff." There's no magic bullet that will keep lots of bits around for a long time without any loss.

My main comment during the session was that software and hardware and even power are not the big costs of digital preservation (at least at ICPSR); the big costs are people, and the processes that require people.

After a short break, we began the next session, Storage Products & Future Trends: Vendor Perspectives, at 3:15. Again, the format was a seven-minute presentation:

Art Pasquinelli, Sun Microsystems, spoke about the Sun PASIG, an d described how researchers were looking to IT and libraries for their digital preservation needs, and how that was simply not working.
Mike Mott, IBM, asked how we define a "document" in a digital world, and thought that we would see the end of Moore's Law (in storage) by 2013 unless there was a new technological breakthrough. Mike also described a new paradigm in architecture: a river v. a building.
Dave Anderson, Seagate, spoke on how he expected some trends to end (approx 40%/year increase each year in capacity + 20%/year increase each year in transfer rate); how solid state disk uptake has been slower than expected; and, how the change in disk form factor from the desktop (3.5") to the laptop (2.5") will shift the industry.
Tim Harder, EMC, described a new "compute + storage" solution called Atmos, and how they are betting big on x86 technology + virtualization, and off-the-shelf gear bundled with software. EMC has also founded a cloud division.
Paul Rutherford, Isilon, expects SATA, 3.5" form-factor disks, and block-level access to disappear, replaced by SAS, 2.5" form-factor disks, and file-level (or object-level) access. He said "I hate the cloud" and didn't think it would be used as the sole source for important data.
Kevin Ryan, Cisco Systems, gave a high-level overview of Cisco's "unified fabric" vision which sounds like it consists of a single, lossless, open pipe for all sorts of bits: network, data, NAS, SAN, etc.
Raymond Clarke, Sun Microsystems, gave a similar type of talk, but about the Sun Cloud which struck me as an umbrella term for an integrated solution using lots of Sun's open technologies: Solaris, Java, ZFS, MySQL, etc.

The session ended with another general conversation about trends. We then had a brief close-out (with homework!) to end the day about 4:45pm.

Monday, September 21, 2009

More Good News: ICPSR Wins NSF Grant

There's more good news for ICPSR on the funding and technology front. I learned last week that our NSF proposal to the EAGER (EArly-concept Grants for Exploratory Research) program was funded. In the proposal, Shared Digital Technologies for Data Curation, Preservation, and Access: A Proof of Concept, we describe a series of objects (Content Models, Service Definitions, Service Deployments) that we will build as an exemplar on how one might store social science datasets and documentation in a Fedora-type repository.

We have been exploring Fedora quite actively since the release of version 3, and this is a wonderful opportunity to move our work from the wings and into center stage. The majority of our efforts so far have been focused more on transferring our legacy content (the keepsakes) from file-based storage into object-based storage, but the work from this grant will be much more forward-looking. This will be a chance to renew work on the early content modeling work I reported on back in May.

My sense is that between this grant and the NIH Challenge Grant, we'll need to look for a reasonably senior systems analyst/developer before the end of the calendar year. (We may even need to add two.)

Sunday, September 20, 2009

The ICPSR Web Site and Security

"Is the ICPSR Web site secure?"

This question - in one form or another - crops up from time to time. Sometimes the question comes from a prospective depositor; sometimes from an ICPSR research submitting a grant application; sometimes from ICPSR's Council; etc.

The quick answer is that "Yes" but often the person asking the question would like a bit more information, of course. ICPSR doesn't face the same sort of threats and level of attacks as the more high-profile targets (banks, sites of people or organizations who are embroiled in controversy, government sites), but we do see regular, unsophisticated attacks on a regular basis. This post describes a few of the pro-active steps we take to ensure the site is working well for our members, partners, and other clients.

(1) Open Source software. Wherever practical we like to use open source software. Despite some reports that using open source is risky, our own experience is that using open-source software minimizes risk. The vast majority of security incidents that require action on our part (e.g., updating software) apply to proprietary vendor software from companies.

(2) SSH and the ssh port. Very few people at ICPSR have shell access to the Web server, and those that do must use ssh to connect. Unlike ftp or telnet where passwords (and sessions) pass in the clear, ssh presents an encrypted channel between the point of access (e.g., a desktop Mac) and the Web server. We also deploy ssh service on a non-standard TCP/IP port number to screen out the thousands and thousands of daily login attempts from the "script kiddies."

(3) Vulnerability scans. Like most computers at the University of Michigan, our central IT Security Services (ITSS) organization scans our Web server quarterly for common vulnerabilities. Unlike most computers, however, ITSS also scans us monthly using a more extensive list of vulnerabilities. Asking a third-party to audit the system is really, really valuable and helps identify issues that might not otherwise make it onto our radar screen.

(4) Network and system monitoring. Another third-party that helps us out is the Merit Network Operations Center (NOC). (Merit is the research and education network in the State of Michigan, and was one of three organizations that built the original Internet - the NSFnet.) The NOC monitors all of our systems 24 x 7 x 365, paging us within three minutes of any serious problem. While this doesn't necessarily prevent a problem such as a denial of service attack, it does bring it to our attention immediately so that we can take corrective action, day or night.

(5) Reviewing system and application logs. We use a rotating on-call schedule to field alerts and pages from the NOC. In addition to enjoying the thrill of carrying a pager, cell phone, laptop, and broadband wireless card, the on-call also has the responsibility for reviewing key system and application logs during his/her tour of duty. For example, the on-call inspects the nightly logwatch reports for all sorts of possible problems.

(6) Replica web site. ICPSR maintains a replica of its Web delivery infrastructure in one of the public computing and storage cloud to guard against an extended problem with the web site. Like NOC monitoring above, this does not prevent a security problem, but it does make it easier for us to recover from an attack against ICPSR or the University of Michigan.

(7) Backups and archival storage. Like almost any organization with important digital content, ICPSR employs standard tape backup strategies for its "working data." Our storage appliance writes several TBs of content to a multi-drive, tape library system in a separate building, and we then remove tapes from that system on a regular basis, placing them in a third location. Unlike most organizations (except for libraries and archives), we also move selected materials into archival storage. Items in archival storage are tested on a regular basis for integrity against their digital fingerprints, and are replicated many times for durability. Copies are stored in many different locations.

(8) SANS certification. We have found it useful to ensure that at least one member of the team is always up-to-date with his/her SANS certification. SANS is a widely recognized source for security training and certification, and it has often been helpful to have someone on the team with both the expertise and the credentials one receives from SANS.

(9) Network management, router filters and firewalls. Operating a data network has no strategic value to ICPSR, and therefore rather than operate one as amateurs (or misdirecting the resources to acquire the expertise in it), we contract with the University of Michigan's central IT organization to manage our network. They monitor it 24 x 7 x 365 for problems; make sure that the router and switch software is patched and current; and also maintain a series of router filters that act like a firewall, screening out unwanted traffic. For example, one router filter prevents anyone from outside connecting to the sqlnet port on our database server. (Even if one can reach the sqlnet port, a valid login and password is still required, of course.)

(10) Version control. One of the biggest risks to systems is oneself. To keep our systems safe from our own mistakes, we make extensive use of version control. Key system configuration files get checked into and out of RCS; Web site content and software goes into our CVS repository.

Friday, September 18, 2009

ICPSR Receives NIH Challenge Grant

Felicia LeClere, the Director of our Data Sharing for Demographic Research project, and I collaborated on an NIH Challenge Grant proposal to explore cloud computing technologies as a mechanism to deliver restricted-access data. Despite fierce competition (there were nearly 18,000 proposals submitted), ICPSR learned recently that our proposal was accepted, and that the NIH will fund the project. Felicia is the PI for the grant, and will provide overall leadership, and my team will be have the lead on execution and deliverables. In addition to our own colleagues at ICPSR, we're also working with partners at the RAND Corporation and at the University of Michigan Survey Research Center to test and evaluate the system. The title of the project is Exploring New Methods for Protecting and Distributing Confidential Research Data.

From the proposal:

In this project, the Inter University Consortium for Political and Social Research and partners at the Rand Corporation and the Survey Research Center at the University of Michigan will build and test a data storage and dissemination system for confidential data, which obviates the need for users to build and secure their own computing environments. Recent advances in public utility (or “cloud”) computing now makes it feasible to provision powerful, secure data analysis platforms on-demand. We will leverage these advances to build a system which collects “system configuration” information from analysts using a simple web interface, and then produces a custom computing environment for each confidential data contract holder. Each custom system will secure the data storage and usage environment in accordance with the confidentiality requirements of each data file. When the analysis has been completed, this custom system will be fed into a “virtual shredder” before final disposal. This prototype data dissemination system will be tested for (1) system functionality (i.e., does it remove the usual barriers to data access?); (2) storage and computing security (i.e., does it keep the data secure?); and (3) usability (i.e., is the entire system easier to use?). Contract holders of two major data systems (the Panel Study of Income Dynamics and the Los Angeles Family and Neighborhood Study) will be recruited to assess both the user interface and the analytic flexibility of the new customized computing environments.

This is a very exciting opportunity for ICPSR to continue its exploration and evaluation of public computing clouds for enabling research. If our test is successful, this may also be another delivery mechanism that we add to our upcoming Restricted-access data Contracting System (RCS), where researchers apply online.

I'll be working on the technology portion of the grant, of course, and so will Steve Burling, a member of the ICPSR technology team. Steve has been leading most of our cloud computing efforts over the past year, and has acquired a lot of experience with Amazon's services during that time. To complement what Steve brings to the table, we'll also be posting a position for a fairly senior position: someone who brings solid expertise with Windows systems and who also has gained recent experience with one of the public computing clouds. That job will appear on the University of Michigan central job site, but I'll post a link to it here too once it goes live.

It will be very, very early in the project, but I'm hoping to talk about it in a preliminary way at the upcoming Coalition for Networked Information Fall 2009 Membership Meeting. I hope to see some of you there!

Wednesday, September 16, 2009

Off-shoring into the cloud

We've been using cloud services from Amazon Web Services - Elastic Compute Cloud, EC2; Elastic Block Storage, EBS; and Simple Storage Service, S3 - for nearly a year now. Our first use was to build a replica of our web server that can be pressed into service during an emergency. Since then we've also used the cloud to deploy our search engine technology, and to deploy prototype systems for new web sites we'll be launching.

We recently took our first step into the non-North American cloud when we launched an Amazon EC2 instance in their EU region. This instance isn't hosting a web site or web service, but rather is a first step in making a copy of our holdings "off shore" for disaster preparedness. We're only keeping a copy of the content we make available on the web site, and so we can any issues about confidential or sensitive content.

Copying content into the EU region is significantly slower than copying into the US region in our experience. An rsync job took 10 days to copy our 400GB of downloadable content to our EU instance's attached EBS. If my back of the envelope math is correct, then this means we only averaged about 0.5Mb/s during the copy.

Tuesday, September 15, 2009

Library of Congress Meeting on Storage Architectures for Digital Collections

I'm heading to Washington, DC next week to participate in a meeting hosted by the Library of Congress. Laura Campbell, the Associate Librarian for Strategic Initiatives describes it this way:

The meeting will bring together technical industry experts, IT professionals, digital collections and strategic planning staff, government specialists with an interest in preservation, and recognized authorities and practitioners of Digital Preservation. We would like to be able to make progress on identifying the areas that should matter to those who are responsible for the digital content and for those who are responsible for providing the services to manage the content. We are hoping that we can inform their decision-making in the future, and give them confidence and comfort that they are asking the right questions and can understand the answers. We believe these questions and answers are often common to organizations dealing with different types of content used for different purposes, so we expect that the topics will be of broad interest in the community.

This should be an interesting meeting, and is a nice complement to many related activities at ICPSR. For example, earlier this year I participated in Nancy McGovern's Digital Preservation Management workshop, which was a wonderful overview of the latest news and best practices, and helped fill in several gaps in my understanding of the OAIS reference model. (Nancy is also my colleague at ICPSR and serves as our Digital Preservation Officer.)

Saturday, September 12, 2009

ICPSR: Then and Now: Technology Human Resources: Part II

As I mentioned in the last post, the team faced two main challenges in 2002: How to grow its capacity for managing IT resources without adding more people; and, how to expand its capacity for delivering solutions, and becoming a true partner at ICPSR.

We addressed the first challenge by asking our administrative assistant to step into a technology support role. This transition was largely successful, but when the administrative assistant retired at the end of 2005, we refilled the position with someone who had already been working in the IT sector. We also made one new hire in this area, adding Asmat Noori as as assistant IT director with responsibility for operations. Asmat's team supports over twice as many systems as 2002, and his introduction of tools such as Altiris and Wise has allowed us to keep the size of the team the same. That said, we're hoping to use a Challenge Grant to fund a new person who will lead ICPSR's adoption of cloud computing and cloud storage technologies.

We addressed the second challenge through a combination of writing grants and contracts, and recruiting software developers with expertise and experience in technologies such as Java. We also encouraged internal ICPSR businesses, such as the Summer Program, to fund directly portions of software developers when there is a need for sustained development and enhancement, such as the new Summer Program Portal.

This has been a very successful combination with new software developers joining the team in 2003 (to work on the Child Care and Early Education Research Center project), 2005 (to automate key data processing and data pipeline work flows at ICPSR), 2007 (one to build tools for the Minority Data Resource Center and one to build the Summer Program Portal), and 2008 (to build technology for the Quantitative Social Science Digital Library). Because the software development team had become so large and worked with so many partners across ICPSR, we also hired an assistant director for software development, Nathan Adams, in late 2008.

Cole Whiteman joined ICPSR in 2004, and brought his skills of process forensics to our data management activities. Cole later joined the Computer and Network Services team, and continues his work to analyze processes and build software systems. Cole built and support systems for managing most of our metadata and the deposits that arrive via our on-line deposit system.

And Peter Joftis returned to his technology roots, re-joining the CNS team in 2009 after leading the Child Care and Early Education Research Center project for the past six years. Peter's current focus is on CCEERC content, and how best to store it in a Fedora repository.

And so the story ends, for now, in 2009 with an IT organization that looks very different.

Like in 2002 it still manages and operates ICPSR's considerable technology infrastructure, and despite the growth in those assets, does so with just about the same number of people as in 2002.

However, in 2009, the number of software developers has grown from two to eight, and the capacity to work in a broad array of technologies has increased dramatically. Also, its ability to work with stakeholders at ICPSR to analyze processes, design solutions, co-write grant applications, and deliver new products and services has grown even more.

Thursday, September 10, 2009

ICPSR: Then and Now: Technology Human Resources: Part I

When I joined ICPSR in 2002 the technology team - called Computer and Network Services - had eight people. In addition to myself, there was an administrative assistant, two software developers, three systems administrators, and one technology generalist who did a little bit of everything. Longtime ICPSR staffer Peter Joftis was also on the team, but soon left to lead ICPSR's Child Care and Early Education Research Center.

The team managed about 75 desktop workstations, a small number of servers, the local area network (LAN), a dozen or so printers, and no doubt a handful of other technology assets I've lost track of over time. With only two software developers the team spent most of its time delivering incremental changes to the web delivery system, and tending to core infrastructure, such as our database and web applications for managing information about the membership. At the time we were very much a classic IT shop.

People worked very hard, but we were on the margin of the business. And we were perceived that way. The IT people were the ones who fixed your PC when it had a virus. They patched computers when Microsoft announced yet another security flaw in Windows or Office. They took care of backups, and retrieved that file you deleted accidentally. These were valuable services to be sure, but they weren't core to the business. They don't think up solutions to our problems; they just implement the technology solutions we think up. We were not partners.

Because ICPSR is so clearly in the information business, but made almost no entrepreneurial investments in information technology, 2002 was a very dangerous time for the organization. But what to do?

When interviewing for the position it was clear that the team fell into two distinct functional subgroups. One group delivered those classic IT support functions, and the other group delivered new products and services.

It would be important for the first group to remain about the same size, but expand its capacity to support more of everything: more servers, more storage, more desktop workstations, more printers, etc. And so we would need to invest in tools and processes to build this capacity without adding significantly to the number of people on the team.

And it would be important for the second group to grow. A lot.

One, it would need to be a bigger team. The capacity to deliver new products and services, to explore new technologies, and to automate the many manual processes at ICPSR all needed to be expanded dramatically.

Two, it would need to be a more partner-oriented team. The team needed to expand its ability to work hand-in-hand with data processors, archive managers, and grant writers to understand key business problems and opportunities, and to recommend and build solutions to address those needs. Many of the software developers would need to become project managers and systems analysts.

And, three, it would need to expand its repertoire of technologies. The team had been working largely in CGI/Perl to build web applications, and Perl alone to build command-line utilities, and those were the appropriate, dominant technologies of the 90's. But by 2002 there were many other technologies available, and the team needed to select the best, and build its collective muscle around them.

Next: The team evolves

Tuesday, September 8, 2009

ICPSR: Then and Now: Archival Storage

Archival Storage, the OAIS function responsible for storing and retrieving content, was built on DLT IV tapes at ICPSR in 2002. Files that we wanted to keep indefinitely were moved to a pair of DLT tapes; one copy was retained at ICPSR, and the other was stored at an off-site location in Ann Arbor, Michigan.

And, unfortunately, we also had a large number of older tape formats as well: IBM 3480 cartridge and 9-track. Again there were two copies, but in this case, both were off-site.

As you might expect with an off-line system such as this, it was very expensive to retrieve any item from Archival Storage. Also, if the requestor was a little fuzzy about the exact item of interest, that would also add to the cost. There was no good way to browse the holdings, and retrieval time was measured in days not minutes.

Today we've moved the master copy of each file from tape to disk, and we replicate each file off-site using a variety of techniques, such as rsync and the Storage Resource Broker Srsync utility. We also keep a copy on tape too, but instead of DLT IV, we're using LTO-3 tape which is ten times more dense. And so this gives us more copies in more locations, and a high degree of confidence that the copies are synchronized.

The next step in Archival Storage is a move away from file-based solutions to object-based solutions. We've been evaluating Fedora as a possible storage platform for social science datasets and documentation, and the results are very promising so far.

Friday, September 4, 2009

ICPSR: Then and Now: Desktop Workstations

In many ways the world of desktop computing has changed very little over the past seven years at ICPSR. Each ICPSR staff member receives a t workstation that includes a tower, display, keyboard, and mouse. In general we did not assign staff peripherals such as cameras, speakers, or external storage in 2002, and we still do not today.

The front face of the display is about the same size now as it was then, 17 inches. Some staff today may have a 20" widescreen display, but the overall square footage for display hasn't changed much. However, the display on a desktop workstation today is much lighter than it was in 2002 (LCD v. CRT), and greater screen resolution makes the display feel larger still. We also have more staff who are interested in funding an upgraded display such as dual monitors or a single very large display. So perhaps if the metric is area, things have changed very little, but if the metric is weight or number of pixels, things have changed quite a bit.

The appearance of the workstation itself has changed little. In 2002 each person had a tower-style workstation under the desk, and most people have a machine with the same general shape. In 2002 most of the machines had a Dell nameplate, but today they are much more likely to have an H-P template. The price difference wasn't a driver for us to move to H-P; the quality control of the Dell tower systems had really slipped in our experience.

The 2002 Dell was likely a GX 240 with 256Mb of memory, and was running Windows 2000 or Windows XP. The 2009 H-P is 4Gb of memory, and some sort of dual-core processor. It still runs Windows XP.

The 2009 workstation likely has a huge local disk - it's difficult to buy a machine with a small disk - that is largely unused (at least for business reasons). The disk will have a system partition which most staff can't modify, and which holds applications and the operating system. The rest of the disk forms a second very large partition, but since we don't back up desktop workstations, it shouldn't be used for most work processes other than as scratch space. Valuable work products are stored on our NAS, the central file server.

The next step for desktop workstations at ICPSR may be to begin using ultra-thin clients: machines that have little storage and processing power because all of the heavy lifting takes place in the cloud. In some ways this is a return to the X terminals that were so popular twenty years ago.

Wednesday, September 2, 2009

ICPSR: Then and Now: Servers

In 2002 ICPSR had two main systems - a pair of Sun E3500s with 4GB of memory. One machine served as our production web server, and the second did everything else: general-purpose computing for data processing, Oracle database service, file service (NFS and CIFS via samba), DNS service, etc. We also had a very small number of additional machines, such as a system for testing new web applications. All of the machines were built by Sun Microsystems, used Sun's SPARC processors, and ran Sun's operating system, Solaris. We entered into a maintenance contract with Sun in case either of the machines had a problem, and my recollection is that it ran around $15k/year to cover the two big machines plus a handful of external storage arrays. To Sun's credit they were very solid machines.

In 2009 ICPSR has more servers than I can describe easily in a blog post. We still have a pair of machines for delivering web content and general-purpose computing, but they were built by Dell, use Intel processors, and run Red Hat Linux. Today's machines have much more memory and many more processors, and they too have been solid. But we also have many smaller machines with very specific roles: delivering network services (DNS, DHCP, etc); operating our LOCKSS network; staging new web content; replicating services for the Minnesota Population Center; hosting MySQL and Oracle databases; and so on. And, of course, in 2009 Sun Microsystems is about to be swallowed by Oracle.

However, this proliferation of server computing systems has likely reached its apogee at ICPSR. With the rise of virtualization and particularly the rise of the cloud, we're much more likely to build future systems in Amazon's Elastic Computing Cloud (EC2) rather than building them on real (or virtual) machines at ICPSR. For every rack-mount server we have at ICPSR, we probably have one much smaller blade server, and for every blade server, we probably have one EC2 instance running in the cloud.

My sense is that we'll continue this trend, and that where practicable, we'll deploy new systems in a cloud environment rather than purchasing new hardware. In addition to Amazon's cloud offering, the University of Michigan is deploying its own virtualization service, and that will be an attractive choice for systems that consume a lot of network I/O. Amazon charges for network I/O, but U-M does not.

We may also replace several virtual machines in the cloud with an out-sourced service: we already use SalesForce.com as our platform for managing "data leads." It's easy to imagine us adopting OpenID via a service provider such as RPX rather than hosting our own service locally or in a cloud, for example,