Technology at ICPSR: October 2010

Friday, October 29, 2010

TRAC: B2.3: SIPs to AIPs

B2.3 Repository has a description of how AIPs are constructed from SIPs.

The repository must be able to show how the preserved object is constructed from the object initially submitted for preservation. In some cases, the AIP and SIP will be almost identical apart from packaging and location, and the repository need only state this. More commonly, complex transformations (e.g., data normalization) may be applied to objects during the ingest process, and a precise description of these actions (i.e., preservation metadata) may be necessary to ensure that the preserved object represents the information in the submitted object. The AIP construction description should include documentation that gives the provenance of the ingest process for each SIP to AIP transformation, typically consisting of an overview of general processing being applied to all such transformations, augmented with description of different classes of such processing and, when applicable, with special transformations that were needed.

Some repositories may need to produce these complex descriptions case by case, in which case diaries or logs of actions taken to produce each AIP will be needed. In these cases, documentation needs to be mapped between to individual AIPs, and the mapping needs to be available for examination. Other repositories that can run a more production-line approach may have a description for how each class of incoming object is transformed to produce the AIP. It must be clear which definition applies to which AIP. If, to take a simple example, two separate processes each produce a TIFF file, it must be clear which process was applied to produce a particular TIFF file.

Evidence: Process description documents; documentation of SIP relationship to AIP; clear documentation of how AIPs are derived from SIPs; documentation of standard/process against which normalization occurs; documentation of normalization outcome and how outcome is different from SIP.

Note: This particular area is under active discussion at ICPSR. The commentary below describes current processes in place at ICPSR, but these processes are likely to change in the future.

Content enters archival storage at two different points in the lifecycle of a "data processing project" at ICPSR.

When a Deposit is signed, its payload of content (its files) enter archival storage for bit-level preservation. The system generates a unique ID for each deposited file and also keeps track of each file's MIME type, digital signature (via MD5 hash), original name (but sanitized to avoid SQL injection attacks and other like problems), and the date it was received.

A cognizant archive manager assigns each new deposit to a data manager. In the most easy, most trivial case, a data manager may package the submission for long-term preservation and release on the ICPSR web site with little added work. The data manager packages the content into an ICPSR "study" object, collaborates with others at ICPSR to author descriptive and preservation metadata, and performs a series of quality control checks, some of which are automated. Workflow tools record major milestones in the life of the project, and the data manager creates an explicit linkage between deposit and study for future reference. And the system also assigns a unique ID to each of these "processed" files, and captures metadata like digital signature, MIME type, etc.

Thus, at the aggregate level, ICPSR collects strong documentation mapping submission objects to archival objects, but the documentation is much weaker, and often absent, at the more detailed level of files. For example, there is no explicit mapping between deposit file X and study file Y.

Monday, October 25, 2010

DuraCloud pilot update - October 2010

ICPSR's participation in the DuraCloud pilot is coming along nicely. While we were not one of the original pilot members, we were one of the "early adopters" in the second round of pilot users. We've been using DuraSpace pretty actively since the summer.

The collection we selected for pilot purposes is a subset of our archival storage that contains preservation copies of our public-use datasets and documentation. This is big but not too big (1TB or so), and contains a nice mix of format types, such as plain text, XML, PDF, TIFF, and more. At the time of this post, we have 72,134 files copied into DuraCloud.

I've been using their Java-based command-line utility called the synctool to synchronize some of our content with DuraCloud. I found it useful to wrap the utility in a small shell script so that I do not need to specify as many command-line arguments when I invoke it. I tend to use sixteen threads to synchronize content rather than the default three, and while that places a heavy load on our machine here, it leads to faster synchronization. The synctool assumes an interactive user, and has a very basic interface for checking status.

Overall I like the synctool but wish that it had an option that did not assume an interactive user; something I could run out of cron like I often do with rsync. Because the underlying storage platform (S3) limits the size of files, synctool is not able to copy some of our larger files. I wish synctool would "chunk up" the files into more manageable pieces, and sync them for me. One reason I don't use raw S3 for storage is because of this file size limitation; instead I like to spend a little more money and attach an Elastic Block Storage volume (S3-backed) to a running instance, and then use the filesystem to hide the limitation. Then I can just use standard tools, like rsync, to copy very large files into the cloud.

The DuraCloud folks have been great collaborators: extremely responsive, extremely helpful; just a joy to work with. They've told me about a pair of upcoming features that I'm keen to test.

One, their fixity service will be revamped in the 0.7 release. It'll have fewer options and features, but will be much easier to use. I'm eager to see how this compares to a low-tech approach I use for our archival storage: weekly filesystem scans + MD5 calculations compared to values stored in a database.

Two, their replicate-on-demand service is coming, and ICPSR will be the first (I think) test case to replicate its content from S3 to Azure's storage service. I have not had the opportunity to use Microsoft's cloud services at all, and am looking forward to seeing how it performs.

Friday, October 22, 2010

TRAC: B2.2: Are our AIPs adequate to the job?

B2.2 Repository has a definition of each AIP (or class) that is adequate to fit long-term preservation needs.

In many cases, if the definitions required by B2.1 exist, this requirement is also satisfied, but it may also be necessary for the definitions to say something about the semantics or intended use of the AIPs if this could affect long-term preservation decisions. For example, say two repositories both only preserve digital still images, both using multi-image TIFF files as their preservation format. Repository 1 consists entirely of real-world photographic images intended for viewing by people and has a single definition covering all of its AIPs. (The definition may refer to a local or external definition of the TIFF format.) Repository 2 contains some images, such as medical x-rays, that are intended for computer analysis rather than viewing by the human eye, and other images that are like those in Repository 1. Repository 2 should perhaps define two classes of AIPs, even though it only uses one storage format for both. A future preservation action may depend on the intended use of the image—an action that changes the bit-depth of Trustworthy Repositories Audit and Certification: Criteria and Checklist the image in a way that is not perceivable to the human eye may be satisfactory for real-world photographs but not for medical images, for example.

Evidence: Documentation that relates the AIP component’s contents to the related preservation needs of the repository, with enough detail for the repository's providers and consumers to be confident that the significant properties of AIPs will be preserved.

This item is somewhat difficult to discuss without a crisp set of definitions for AIPs. However, given the longevity of ICPSR as a digital repository, and given the track record (nearly fifty years) of preserving and delivering content, the empirical evidence would seem to indicate that the models and containers we are using for our content are a good fit for our long-term preservation needs.

In some ways ICPSR has a relatively easy task here since our content is pretty homogeneous (survey data and documentation) and we are able to normalize it into very durable formats like plain text and TIFF. And because our content is all "born digital" and delivered digitally, there's fewer opportunities for things to go really awry.

We also create a great deal of descriptive metadata that we bundle with out content, and our content is highly curated compared to, say, an enormous stream of data coming from highway sensors or satellites. In addition to making items easier to find and to use, it may also help keep them more durable as a side-effect.

As part of an NSF Interop/EAGER grant we're defining Fedora Objects for our most common content types, and for each object, we are also working through the specifications of the AIP. My sense is that this will help us formalize some of our current practices, and will help illuminate any gaps where we should be collecting and saving metadata, but aren't today. And that will help further inform the response to this TRAC item.

Wednesday, October 20, 2010

I'll Miss Swivel

A few years ago the ICPSR Director at the time (Myron Gutmann) told me about a new data visualization service he had come across: Swivel. I found the site, created an account, and started playing with the tools they made available to visualize data.

It seemed like a nice little service, but not much of a competitor for the type of clients that ICPSR typically serves. For one thing, all of the datasets needed to fit into Excel.

I did take the opportunity to create a public-use dataset of my own. It was just a little toy dataset that had one row for each year, and where the columns were the annual dues for our neighborhood association for that year, that same amount of money expressed in 1993 CPI dollars, and the maximum amount the dues could have been for that year ($200 in 1993 adjusted by the CPI). This made it easy to create graphs and images that showed how little the neighborhood dues had gone up over the years.

However, Swivel is no more. Navigating to the home page of their web site just times out. I found a nice piece by Robert Kosara where he talks to the founders about what Swivel was, and where things went wrong. It is a short, interesting read: the punchline is that they just didn't have any customers.

I think I could probably create the same dataset at Zoho or on Google Docs, but neither one of those has quite the same nice set of features for visualizing the data as Swivel did.

Friday, October 15, 2010

TRAC: B2.1: AIPs

B2.1 Repository has an identifiable, written definition for each AIP or class of information preserved by the repository.

An AIP contains these key components: the primary data object to be preserved, its supporting Representation Information (format and meaning of the format elements), and the various categories of Preservation Description Information (PDI) that also need to be associated with the primary data object: Fixity, Provenance, Context, and Reference. There should be a definition of how these categories of information are bound together and/or related in such a way that they can always be found and managed within the archive.

It is merely necessary that definitions exist for each AIP, or class of AIP if there are many instances of the same type. Repositories that store a wide variety of object types may need a specific definition for each AIP they hold, but it is expected that most repositories will establish class descriptions that apply to many AIPs. It must be possible to determine which definition applies to which AIP.

While this requirement is primarily concerned with issues of identifying and binding key components of the AIP, B2.2 places more stringent conditions on the content of the key components to ensure that they are fit for the intended purpose. Separating the two criteria is important, particularly if a repository does not satisfy one of them. It is important to know whether some or all AIPs are not defined, or that the definitions exist but are not adequate.

Evidence: Documentation identifying each class of AIP and describing how each is implemented within the repository. Implementations may, for example, involve some combination of files, databases, and/or documents.

Does anyone have written definitions for their AIPs?

I found a preliminary design document at the Library of Congress via a Google search that had a very long, very complete description of a proposed AIP for image-type content. But in general it seems hard to find real world examples of AIPs that are in use at working archives. Perhaps they are out there, but published in such a way that makes it difficult to discover them?

Here is my strawman stab at defining an AIP for the bulk of ICPSR's content: social science research data and documentation. This is very much a work-in-progress and should not be read as any sort of official document. Here goes:

Definition of an Archival Information Package (AIP) for a Social Science Study

We define an AIP for a social science study as a list of files where each file has supporting representation information in the form of:

a role (data, codebook, survey instrument, etc)
a format (we use MIME type)

and has the following Preservation Description Information:

Provenance. We link processed studies to initial deposits at aggregation-level, and we also collect processing history in our internal Study Tracking System which records who performed actions on the content, and major milestones in its lifecycle at ICPSR.
Context. We store related content together in the filesystem, and a good deal of the context embedded in both the name of each file and in a relational database. While not in production, we are evaluating the use of RDF/XML as a method for recording and exposing contextual information.
Reference. Each file has a unique ID.
Fixity. We use an MD5 hash at file-level to capture and check integrity.

So there's the strawman. To help guide my description of the PDI, I used these definitions from the Open Archival Information System (OAIS) specification:

– Provenance describes the source of the Content Information, who has had custody of it since its origination, and its history (including processing history).
– Context describes how the Content Information relates to other information outside the Information Package. For example, it would describe why the Content Information was produced, and it may include a description of how it relates to another Content Information object that is available.
– Reference provides one or more identifiers, or systems of identifiers, by which the Content Information may be uniquely identified. Examples include an ISBN number for a book, or a set of attributes that distinguish one instance of Content Information from another.
– Fixity provides a wrapper, or protective shield, that protects the Content Information from undocumented alteration. For example, it may involve a check sum over the Content Information of a digital Information Package

Wednesday, October 13, 2010

Designing Storage Architectures for Digital Preservation - Day Two, Part Two

The final session of the conference featured six speakers.

Jimmy Lin (University of Maryland) is spending some time at Twitter, and described their technology stack: hardware, HDFS, Hadoop, and pig, which he described as the "perl/python of big data."
Mike Smorul (University of Maryland) gave an overview of their "time machine for the web" and the challenges of managing a web archive
John Johnson (Pacific Northwest National Laboratory) proposed that the scientific process has changed in that data produced by computation is now one of the drivers for creating and testing new theories
Leslie Johnston (Library of Congress) spoke briefly about an IBM emerging technology called "big sheets"
Dave Fellinger (DataDirect Networks) urged the audience to "don't be afraid to count machine cycles" when analyzing storage systems for bottlenecks that increase service latency
Kevin Kambach (Oracle) finished the session with industry notes about large data

The day then concluded with two final talks. One was from Subodh Kulkarni (Imation) who gave an overview of storage technology from magnetic tape to hard disk, and the other was from David Rosenthal (LOCKSS) who gave an abbreviated version of his iPres talk, "How Green is Digital Preservation?" David mentioned a very interesting, large-scale, low-power computing and storage platform being produced by a company called Seamicro.

Tuesday, October 12, 2010

Designing Storage Architectures for Digital Preservation - Day Two, Part One

The Library of Congress hosted a two-day meeting on September 27 and 28, 2010 to talk about technologies, strategies, and techniques for managing storage. Like the 2009 meeting, which I also attended, the meeting is heavily focused on IT and the costs of the the technology. This was another interesting and valuable meeting, but it always feels like we don't address the elephant in the room: the cost of all of the people who curate content, create metadata, manage collections, assess content, etc. This is the report from the second day of the conference.

The morning session of the second day of the conference featured six speakers, many from industry:

Micah Beck (University of Tennessee - Knoxville) made an argument for "lossy preservation" as a strategy for achieving "good enough" digital preservation in an imperfect world, and suggested that developing techniques for using damaged objects should be part of the archivists' toolkit.
Mike Vamdamme (Fujifilm) gave an overview of their StorageIQ product as a system to augment the reporting and metadata available from conventional tape-based backup and storage systems
Hal Woods (HP) spoke about StorageWorks
Mootaz Elnozahy (IBM) spoke about trends in reliable storage over the next 5-10 years, and predicted that power management requirements will stress hardware causing the rate of MTBF, and the soft error rates of storage to increase.
Dave Anderson (Seagate) also spoke about near-term trends such as a shift to 3TB disks and 2.5" form-factor drives. He does not see solid state as a factor in the market at this time.
Mike Smorul (University of Maryland) gave a very brief overview of ACE.

The next session featured four more speakers:

Joe Zimm (EMC) was part of Data Domain before being acquired by EMC, and spoke about EMC's block-level de-duplication technology.
Mike Davis (Dell) was part of Ocarina before being acquired by Dell, and spoke about their technology for de-duplication.
Steve Vranyes (Symantec) opined that compression will play a more significant role than de-duplication in easing storage requirements for archives because the use case is very different.
Raghavendra Rao (Cisco) introduced Cisco's network layer de-duplicator. This seemed like an odd fit in some ways compared to the other products.

Up next - the final post in this series: the second half of Day Two.

Friday, October 8, 2010

TRAC: B1.8: Ingest records

B1.8 Repository has contemporaneous records of actions and administration processes that are relevant to preservation (Ingest: content acquisition).

These records must be created on or about the time of the actions they refer to and are related to actions taken during the Ingest: content acquisition process. The records may be automated or may be written by individuals, depending on the nature of the actions described. Where community or international standards are used, such as PREMIS (2005), the repository must demonstrate that all relevant actions are carried through.

Evidence: Written documentation of decisions and/or action taken; preservation metadata logged, stored, and linked to pertinent digital objects.

ICPSR's main business process is to take deposited materials (usually studies) and prepare them for preservation and dissemination (also as studies). We use two internal webapps to collect, display, and set milestones and records during this process.

One system is called the Deposit Viewer, although it might be more properly called the Deposit Manager. ICPSR staff use it to change the status of a deposit, assign a deposit to a worker, to read or create metadata about the deposit, and to link deposits to studies. This system also allows staff (and sometimes requires them) to make comments in a diary associated with the deposit.

The other system is called the Study Tracking System, and like the Deposit Viewer, it collects milestones and diary entries during the ingest lifecycle.

The records are stored in a relational database. This ensures that the content is readily available to the large corpus of workflow tools we've created. We've been looking at PREMIS as a container for exposing these records to the outside world (where appropriate - like to an auditor perhaps), and for preserving them. I have a personal interest in PREMIS and took a shot at creating PREMIS XML for our ingest records. I'd be interested in comparing notes with others who have been working on mapping their internal ingest records to a schema like PREMIS.

Wednesday, October 6, 2010

WHEN ZOMBIES ATTACK!

This is my favorite fun paper that I've read this year.

When Zombies Attack!

Tuesday, October 5, 2010

NSF Social, Behavioral and Economic Directorate suggests ICPSR for data archiving

The much anticipated NSF guidelines on data management were released earlier this week. One highlight (especially for those of us working at ICPSR) is that the NSF explicitly recognizes ICPSR as a good option for archiving quantitative social science data.

The SBE Directorate supplements agency-wide guidelines with some of its own, and has this to say:

Quantitative Social and Economic Data Sets

For appropriate data sets, researchers should be prepared to place their data in fully cleaned and documented form in a data archive or library within one year after the expiration of an award. Before an award is made, investigators will be asked to specify in writing where they plan to deposit their data set(s). This may be the Inter-University Consortium for Politicaland Social Research (ICPSR) at the University of Michigan, but other public archives are also available. The investigator should consult with the program officer about the most appropriate archive for any particular data set.

Monday, October 4, 2010

Can an iPad replace a laptop on a business trip?

Walt Mossberg of the Wall Street Journal stole my idea!

Well, not really. But he did recently write a nice column about his experience taking an iPad on a "working vacation" rather than a laptop. I did the same thing for last week's trip to DC.

Like Mr Mossberg I wanted something that would let me keep in touch with the office, and that would help me pass some time at the airport and on the airplane. I did not need Office-style applications since I was not intending to work with spreadsheets or deliver a presentation.

And like Mr Mossberg I too found that having the iPad alone was just fine for everything I wanted to do. Safari gave me access to Gmail, the Kindle app gave me access to books to read, and a few other apps (e.g., Maps) filled in the rest of my needs. Had I needed to deliver a PowerPoint deck, however, I would have brought a way-way-too-slow HP Mini netbook instead.

And, also like Mr Mossberg, I too spoke with several folks in the airport - especially the TSA checkpoints - about the iPad, giving it pretty rave reviews.

Friday, October 1, 2010

TRAC: B1.7: Formal acceptance of deposits

B1.7 Repository can demonstrate when preservation responsibility is formally accepted for the contents of the submitted data objects (i.e., SIPs).

A key component of a repository’s responsibility to gain sufficient control of digital objects is the point when the repository manages the bitstream. For some repositories this will occur when it first receives the SIP transformation, for others it may not occur until the ingested SIP is transformed into an AIP. At this point, the repository formally accepts preservation responsibility of digital objects from the depositor.

Repositories that report back to their depositors generally will mark this acceptance with some form of notification to the depositor. (This may depend on repository responsibilities as designated in the depositor agreement.) A repository may mark the transfer by sending a formal document, often a final signed copy of the transfer agreement, back to the depositor signifying the completion of the transformation from SIP to AIP process. Other approaches are equally acceptable. Brief daily updates may be generated by a repository that only provides annual formal transfer reports.

Evidence: Submission agreements/deposit agreements/deeds of gift; confirmation receipt sent back to producer.

My sense is that this requirement has two very different stories at ICPSR.

One story is pretty simple. When the depositor signs his/her deposit, custody transfers to ICPSR. We then work the deposit until we have a post-processed version suitable for digital preservation and versions suitable for delivery on the web site.

The other story is more complicated. The workflow allows one to un-sign a deposit. And so the custody of the object could transfer from the depositor to ICPSR (at the initial signing), and then back to the depositor (upon un-signing). This can even happen in a case where the deposit has been processed and released on the ICPSR web site. The workflow records this sort of action, and so it is well documented, but does leave open a degenerate case where content is available on the web without the corresponding original submission in archival storage.