Technology at ICPSR: The ICPSR Pipeline Process

After arriving at ICPSR in 2002 one of the first things Myron asked me to do was to automate the "data processing" function at ICPSR. As I began exploring that business process, it became very clear to me that (1) the process was actually a series of six or more inter-related processes with different owners, inputs, and outputs, and (2) no single person at ICPSR had a crisp, clear understanding of the entire process. And so my dilemma: How to design and build a software system to facilitate a process which isn't documented, and isn't even well understood?

Fortunately a colleague of mine at UUNET agreed to join ICPSR: Cole Whiteman. Cole is very strong at analyzing and documenting business process, and has a particular gift for coaxing out the details and then rendering the system in an easy to understand, easy to read format. I've included the latest "Whiteman" above as a sample of his art.

Cole spent many months interviewing staff, drawing pictures, interviewing more staff, refining pictures, and so on, until he had a picture that both generated agreement - "Yep, that's the process we use!" - and demonstrated bottle-necks. Now the way was clear for automation.

Consequently, ICPSR has invested tremendous resources over the past few years building a collection of inter-connected systems that enable workflow at ICPSR. These workflow systems now form the core business process infrastructure of ICPSR, and give us the capability to support a very high-level of business. When talking to my colleagues at other data archives, my sense is that ICPSR has a unique asset. Here's a thumbnail sketch of the systems.

Deposit Form - This system manages information from the time of its first arrival via upload, until the time the depositor signs the form, transferring custody to ICPSR. The form has the capacity to collect a lot of descriptive metadata at the start of the process, and also automatically generates appropriate preservation metadata upon custody (e.g., fingerprints for each file deposited).
Deposit Viewer - This might be more appropriately named the Deposit Manager since it not only lets ICPSR staff search, browse, and view metadata about deposits, it also enables staff to manage information about deposits. For example, this is the tool we use to assign a deposit to a data manager. We also use this tool to connect deposits to studies.
Metadata Editor - This is the primary environment for creating, revising, and managing descriptive and administrative metadata about a study. Abstracts, subject terms, titles, etc. are all available for management, along with built-in connections to ICPSR business rules that control or limit selections. The system also contains the business logic that controls quality assurance.
Hermes - Our automation tool for producing the ready-to-go formats we deliver on our web site, and variable-level DDI XML for digital preservation. This system takes an SPSS System file as its input, and produces a series of files as output, some of which end up on our Web site for download, and others of which enter our archival storage system.
Turnover - Data managers use this tool to perform quality assurance tests on content which is ready for ingest, and to queue content both for insertion into archival storage and for release on the ICPSR Web site. An accompanying web application enables our release management team to accept well-formed content, and to reject objects which aren't quite ready for ingest.

Technology at ICPSR

Thursday, November 19, 2009

The ICPSR Pipeline Process

No comments:

Post a Comment