Technology at ICPSR: November 2010

Wednesday, November 24, 2010

TRAC: B2.7: Format registries

B2.7 Repository demonstrates that it has access to necessary tools and resources to establish authoritative semantic or technical context of the digital objects it contains (i.e., access to appropriate international Representation Information and format registries).

The Global Digital Format Registry (GDFR), the UK National Archives’ file format registry PRONOM, and the UK Digital Curation Centre’s Representation Information Registry are three emerging examples of potential international standards a repository might adopt. Whenever possible, the repository should use these types of standardized, authoritative information sources to identify and/or verify the Representation Information components of Content Information and PDI. This will reduce the long-term maintenance costs to the repository and improve quality control.

Most repositories will maintain format information locally to maintain their independent ability to verify formats or other technical or semantic details associated with each archival object. In these cases, the use of international format registries is not meant to replace local format registries but instead serve as a resource to verify or obtain independent, authoritative information about any and all file formats.

Evidence: Subscription or access to such registries; association of unique identifiers to format registries with digital objects.

The volume of content entering ICPSR is relatively low, perhaps 100 submissions per month. In some extreme cases, such as with our Publication Related Archive, ICPSR staff spend relatively little time reviewing and normalizing content, and it is released "as is" on the web site, and it gets the most modest level of digital preservation (bit-level only, unless the content happens to be something more durable, such as plain text). However, in most cases, someone at ICPSR is opening each file, reading and reviewing documentation, scrubbing data for disclosure risk, recoding, etc, etc. It is a very hands-on process.

Because of the low volumes and high touches, automated format detection is not at all essential, at least for the current business model. Nonetheless we do use automated format detection for both the files that we receive via our deposit system, and for the derivative files we produce internally. And our tool for doing this is the venerable UNIX command-line utility file.

Why?

The content that we tend to receive is a mix of documentation and data. The documentation is often in PDF format, but sometimes arrives in common word processor formats like DOC and DOCS, and sometimes less common word processor formats. The data is often in a format produced by common statistical packages such as SAS, SPSS, and Stata. And we also get a nice mix of other file formats from a wide variety of business applications like Access, Excel, PowerPoint, and more.

We have found the vanilla file that ships with Red Hat Linux to be pretty good at most of the formats that show up on our doorstep. We've extended the magic database that file consults so that it does a better job understanding a more broad selection of stat package formats. (file does OK, but not great in this area.) We also have extended the magic database and wrapped file in a helper tool -- we call the final product ifile, for improved file -- so that it does a better job identifying the newer Microsoft Office file formats like DOCX, XLSX, PPTX, and so on.

I would love to be able to use an "off the shelf" tool like jhove or droid to identify file formats, relying upon a global registry for formats. There isn't much glamor in hacking the magic database.

However, my experience thus far with jhove, jhove2, and droid is that they just don't beat ifile (or even file) for the particular mix of content we tend to get. Those packages are much more heavy-weight, and while they do a fabulous job on some formats (like PDF), they perform poorly or not at all with many of the formats we see on a regular basis.

As a test I took a week or two of the files most recently deposited at ICPSR, and I had file, ifile, and droid have a go at identifying them. I originally had jhove2 in the mix as well, but I could not get it to run reliably. (And my sense is that it may be using the same data source as droid for file identification anyway.) Of the 249 files I examined, ifile got 'em all right, file misreported on 48 of them, and droid misreported 104 files. And the number for droid gets worse if I ding it for reporting a file containing an email as text/plain rather than text/rfc822.

So in the end we're using our souped-up version of file for file format identification, and we're using IANA MIME types as our primary identifier for file format. We also capture the more verbose human-readable output from our own ifile as well since it can be handy to have "SPSS System File MS Windows Release 10.0.5" rather than just application/x-spss.

Tuesday, November 23, 2010

The Cloud and Archival Storage

Price. Availability. Services. Security.

These are the four parameters that I use when deciding where to store one of our archival storage copies.

For me the cloud is just another storage container. Fundamentally it is no different from a physical storage location except in how it differs across these four dimensions. In fact, I can conceptualize my "non-cloud" storage locations as storage as a service cloud providers, but where the provider is a lot more local than the big names in "cloud" today:

ICPSR Cloud: This is the portion of the EMC NS-120 NAS that I use for a local copy of archival storage. It is very expensive with a reasonably high-level of availability. It provides very few services; if I want to perform a fixity check of the objects I have stored here, I have to create and schedule that myself. Because I have physical control over the ICPSR Cloud, I have an irrational belief that it is probably secure, even though I know that ICPSR isn't as physically secure as many other companies at which I have worked. Certainly ICPSR does not make any statements or guarantees about ISO 27001 compliance.

UMich Cloud: This is a multi-TB chunk of NFS file storage that I rent from the University of Michigan's Information Technology Services (ITS) organization. They call it ITS Value Storage. The price here is excellent, but the level of availability is just a hair lower. I don't notice the lower level of availability most of the time, but I do perceive it when running long-lived, I/O-intensive applications. Like my own cloud, this one has no services unless I deploy them myself. Because I do not have physical control over the equipment, or even know exactly where the equipment is (beyond a given data center), it feels like there is less control. ITS makes no promises about ISO 27001 compliance (or promises about other standards), but my sense is that their controls and physical security and IT management processes must be at least as good as mine. After all, they are managing many, many TBs for many different university departments and organizations, including themselves.

Amazon Cloud: This is a multi-TB chunk of Elastic Block Storage (EBS) that I rent from Amazon Web Services. I use EBS rather than the Simple Storage Service (S3) because I want the semantics of a filesystem so that I don't have to worry about things like files that are large or that have funny characters in their names. The price here is good, better than my EMC NAS, but not as good as the ITS Value Storage. The availability is quite good overall, but, of course, the network throughput between ICPSR and AWS is nowhere near as good as intra-campus networking, and it is even worse for the AWS EU location. The services are no better and no worse than my own cloud or the UMich cloud. Like the ITS Value Storage service I have no control over the physical systems, and I know even less about their physical location. Amazon says that it passed a SAS 70 audit, and recently received an ISO 27001 certification. This seems to be a better security story than anyone else so far.

DuraCloud: Unlike the other clouds, I'm not using this one for archival storage; it is still in a pilot phase. The availability is similar to plain old AWS (which hosts the main DuraCloud service), and the price is still under discussion. My expectation is that the level of security is no better (and no worse) than the underlying cloud provider(s), and so depending upon which storage provider one selects, one's mileage may vary. However, the really interesting thing about DuraCloud is the idea of the services. If DuraCloud can execute and deliver useful, robust services on top of basic storage clouds, that will be a true value-add, and will make this a very compelling platform for archival storage.

Chronopolis: Like DuraCloud, this too is not in production yet, and is being groomed (I think) as a future for-fee, production service. I don't have as much visibility here with regard to availability since I am not actively moving content in and out of Chronopolis; most of the action seems to be taking please under the hood between the storage partner locations. My sense is that the level of security is probably similar to the UMich Cloud since the lead organization, the San Diego Supercomputer Center, is in the world of higher education, like UMich, but it may well be the case that they have a stronger security story to tell, and I just don't know it. And like DuraCloud, my sense is that it will come down to services: If Chronopolis builds great services that facilitate archival storage, that will make it an interesting choice.

Monday, November 22, 2010

Fedora Object for overall deposit - example

So far we have seen objects for each of the files belonging to a single deposit, but we have not yet seen a parent-level or aggregate-level container for the deposit itself. Most of the information we have collected has so far been at the file-level, but what about information that relates to the entire set of files?

For this purpose we'll use a separate Fedora Object. It's RDF will assert a relationship to each of the file-level Fedora Objects, and as we have already seen, each of them asserts a relationship to this object on the left.

At this time we are using a generic Fedora Object for the higher level deposit object, but it may make sense to create a separate Content Model for it if we know that it must always have a Datastream that contains a history of deposit-related actions.

For the example object we've created the object to the left. (The object is also a hyperlink to our public Fedora repository.)

The Dublin Core Datastream is relatively empty. Most of the content is captured in a PREMIS XML Datastream at the end. At the time of this post the object contains PREMIS which has been built by hand, and so may not be quite correct. But if the syntax isn't quite correct, we think that the concept is.

The PREMIS is very basic. An obvious enhancement would be to add additional stanzas to capture the terms to which the depositor agreed to help flesh out the Access Rights portion of the Preservation Description Information (PDI).

Next up I'll share our summary of how we think we will generate a Submission Information Package for each of the items in the deposit.

Friday, November 19, 2010

TRAC: B2.6: Matching unique SIP IDs to AIP IDs

B2.6 If unique identifiers are associated with SIPs before ingest, the repository preserves the identifiers in a way that maintains a persistent association with the resultant archived object (e.g., AIP).

SIPs will not always contain unique identifiers when the repository receives them. But where they do, and particularly where those identifiers were widely known before the objects were ingested, it is important that they are either retained as is, or that some mechanism allows the original identifier to be transformed into one used by the repository.

For example, consider an archival repository whose SIPs consist of file collections from electronic document management systems (EDMS). Each incoming SIP will contain a unique identifier for each file within the EDMS, which may just be the pathname to the file. The repository cannot use these as they stand, since two different collections may contain files with the same pathname. The repository may generate unique identifiers by qualifying the original identifier in some way (e.g., refixing the pathname with a unique ID assigned to the SIP of which it was a part). Or it may simply generate new unique numeric identifiers for every file in each SIP. If it qualifies the original identifier, it must explain the scheme it uses. If it generates entirely new identifiers, it will probably need to maintain a mapping between original IDs and generated IDs, perhaps using object-level metadata.

Documentation must show the policy on handling the unique identification of SIP components as the objects to be preserved are ingested, preserved, and disseminated. Where special handling is required, this must be documented for each SIP as a part of the provenance information capture (see B2.3).

Evidence: Workflow documents and evidence of traceability (e.g., SIP identifier embedded in AIP, mapping table of SIP IDs to AIPs).

My sense is that ICPSR is in pretty good shape on this requirement. Here is a stab at the documentation.

ICPSR collects the original file name (sanitized to avoid problems via SQL injection and other attacks) for each item at the time of deposit. Each item also receives a unique ID in our deposit tracking system. And the deposit event itself also receives a unique ID which is also exposed to the depositor.

During the data curation process an ICPSR data manager will mint a new container-level object for access purposes (i.e., a study), and will fill that container with materials derived from the deposit. This activity can run the gamut from a one-to-one mapping between deposited files and released files to something far more complex. In any event ICPSR records and preserves the mapping between the aggregate-level objects, deposits and studies. ICPSR also has plans to record and preserve the mapping at the more fine-grained level of each file.

Thursday, November 18, 2010

Fedora Objects for deposited files - example

We are using deposit #15868 as an example to illustrate the Fedora Objects we will use to store social science research data and documentation that has been deposited at ICPSR. We chose this deposit since each of the files are readily available on a public web site and pose no risk of disclosure. Each of the images below is also a hyperlink to the corresponding object in our public Fedora Commons repository.

The first file in the deposit contains the survey data. We assign a unique ID (Reference) for the file that will not change.

In this case the survey data are in a format produced by the statistical analysis software called SAS, and our file format identification software has assigned it the MIME type of application/x-sas. This content goes into its own Datastream (last one on the left), and we Fedora calculates a message digest to fingerprint the file (Fixity).

We note the original name of the file in the DC Datastream along with the identity of the depositor, the origin of the file, and the identity of the organization that created the file (Provenance).

We capture its relationship to the higher-level deposit transaction via a relationship in the RELS-EXT Datastream, and we also later capture what role this file plays in the data curation lifecycle at ICPSR (Context).

Not captured or shown at this level are the terms to which the deposit agreed when transferring this content to ICPSR (Access Rights). We will store those in the aggregate-level object. Typically the depositor grants ICPSR non-exclusive rights to reproduce and publish the content, but this is not exclusively true.

Likewise, we capture similar information for the other three files in the deposit:

In the next blog post on this topic, I'll publish a description of the aggregate object to which these four assert an isPartOf relationship. Once we have that object as well, we can begin talking about producing an OAIS Submission Information Package (SIP) for each object.

Wednesday, November 17, 2010

Fedora objects for deposits

Researchers and government agencies (and their proxies at ICPSR) use a web portal called the Data Deposit Form to transfer content to ICPSR. The form contains many opportunities for a depositor to enter metadata about the transfer, but only a few are required: the name of the depositor and a name for the deposit.

A deposit may have an arbitrary number of files, and those files may be uploaded individually or as a single "archive" file, such as a Zip or GNU Zip archive. In a case where the depositor uploads an archive file, ICPSR unpacks it to extract the actual content. And if the archive file contains an archive file, ICPSR systems continue unpacking recursively.

Our intention is to put each of the deposited files (unpacked, if necessary) in its own Fedora object. This object will be an off-the-shelf object without any special Content Model. Here is an example:

(Note that all of the images are also hyperlinks to Fedora Objects in our public Fedora Commons repository.)

This is a standard Fedora Object, conforming only to the Content Model for all objects.

Each deposited file contains a unique ID captured in the PID, and the usual, minimal Fedora object properties.

We also enable the Audit Datastream to record any changes to the object, and use the DC (Dublin Core) Datastream to capture some of the metadata we collect via our Data Deposit Form.

We use a relationship expressed in the RELS-EXT Datastream to point to a parent-level object which is used to link the files within a single deposit and to capture any metadata which applies to the entire deposit, not just the individual files.

The content is highly variable. In addition to receiving survey data in plain text, ICPSR also receives data in a variety of proprietary formats (e.g., SAS) and related documentation in a wide array of formats (word processor output, plain text documents, and many others).

To illustrate the example further, we created a Fedora Object for each of the files found in one of our recent deposits. We selected this deposit because the content is entirely public-use, and is readily available from a public web site. The deposit is also a nice size (only four files). To keep this blog post at a reasonable size, I'll save the example for tomorrow's post.

Tuesday, November 16, 2010

Fedora Content Model for Social Science Research Data - Redux

A group of us have been getting together once per week for the past month or two to revisit some of our earlier decisions about social science research data and how we intend to store it in Fedora. (You can find the original content model by searching the blog for the tag 'eager' -- this work is supported by an NSF INTEROP EAGER grant.)

Our thinking about the type of Fedora objects that we would like to use has shifted from our first thoughts in 2009. The original objects aimed to group related content within the same object, but in different Datastreams. We are now thinking of using much simpler objects where the content forms one Datastream, and any related content is packed into its own separate objects, linked together using the RDF syntax available in RELS-TXT. If the file-level object has metadata which doesn't fit well into existing places, then we may create a second Datastream to collect it. For example, if we want to record the preservation actions performed on the file/object,we think it makes sense to capture that in a PREMIS-format Datastream stored alongside the actual file/object content.

I'll kick-off this continuing line of posts tomorrow with an example container for what we call a Deposit at ICPSR. This is the container that a researcher, government agency, or even an ICPSR staffer use to move content into the data curation systems of ICPSR.

Monday, November 15, 2010

Heading to CNI in December

I'll be giving a talk about what we're doing with Fedora for social science data at the Fall 2010 Coalition for Networked Information member meeting. We've been working through a different content model architecture for data and its related documentation, and I'm intending to post some updated objects with links to our public Fedora repository next week.

We have also been spending some time talking about OAIS packages - DIPs, AIPs, and SIPs - for content that is deposited at ICPSR. I think this too will make an interesting post, and I'll get something up about this before the end of the month.

Friday, November 12, 2010

TRAC: B2.5: AIP Naming Conventions

B2.5 Repository has and uses a naming convention that generates visible, persistent, unique identifiers for all archived objects (i.e., AIPs).

A repository needs to ensure that an accepted, standard naming convention is in place that identifies its materials uniquely and persistently for use both in and outside the repository. The “visibility” requirement here means “visible” to repository managers and auditors. It does not imply that these unique identifiers need to be visible to end users or that they serve as the primary means of access to digital objects.

Equally important is a system of reliable linking/resolution services in order to find the uniquely named object, no matter its physical location. This is so that actions relating to AIPs can be traced over time, over system changes, and over storage changes. Ideally, the unique ID lives as long as the AIP; if it does not, there must be traceability. The ID system must be seen to fit the repository’s current and foreseeable future requirements for things like numbers of objects. It must be possible to demonstrate that the identifiers are unique. Note that B2.1 requires that the components of an AIP be suitably bound and identified for long-term management, but places no restrictions on how AIPs are identified with files. Thus, in the general case, an AIP may be distributed over many files, or a single file may contain more than one AIP. Therefore identifiers and filenames may not necessarily correspond to each other.

Documentation must show how the persistent identifiers of the AIP and its components are assigned and maintained so as to be unique within the context of the repository. The documentation must also describe any processes used for changes to such identifiers. It must be possible to obtain a complete list of all such identifiers and do spot checks for duplications.

Evidence: Documentation describing naming convention and physical evidence of its application (e.g., logs).

ICPSR generates a unique ID for each file that we receive via a deposit, and a unique ID for each post-processed file that we create. Files are stored in well-defined locations in archival storage, and between the location and filename (which also follows a set of standard conventions within ICPSR), one can identify considerable provenance information which is also replicated in a database.

Specifically, each file that has been deposited at ICPSR - if retained for preservation - has a unique ID stored in a database, and has a unique location in archival storage: deposits/depositID/originalFilenameSanitized. The root of the location varies depending upon the physical location of the copy in archival storage. For example, a copy stored locally at ICPSR may have a URI root of file://nas.icpsr.umich.edu/archival-storage while a copy stored in an off-site archival location will have a different URI root.

Likewise for content that is produced by ICPSR staff - if retained for preservation - has a similar unique ID and unique location.

Tuesday, November 9, 2010

Just for fun: Conan O'Brien premiere cold open

I came across a link to this today in Kara Swisher's BoomTown blog. It has such a nice connection to The Godfather, I can't help but share the video too.

Monday, November 8, 2010

TRAC: B2.4: SIPs need to become AIPs

B2.4 Repository can demonstrate that all submitted objects (i.e., SIPs) are either accepted as whole or part of an eventual archival object (i.e., AIP), or otherwise disposed of in a recorded fashion.

The timescale of this process will vary between repositories from seconds to many months, but SIPs must not remain in a limbo-like state forever. The accessioning procedures and the internal processing and audit logs should maintain records of all internal transformations of SIPs to demonstrate that they either become AIPs (or part of AIPs) or are disposed of. Appropriate descriptive information should also document the provenance of all digital objects.

Evidence: System processing files; disposal records; donor or depositor agreements/deeds of gift; provenance tracking system; system log files.

This TRAC requirement moves us away from the technology area and into business processes in other parts of ICPSR. As such my critique comes more from a perspective of an informed outsider rather than the responsible party.

My sense is that ICPSR has a good review process in place such that deposits are tracked on a regular basis by our acquisitions staff. If a deposit becomes stuck - which can happen for all sorts of different reasons, only some of which are under the control of ICPSR - the acquisitions team makes sure that it does not fall off the radar screen.

That said, it is certainly possible for ICPSR to receive an unsolicited deposit from a researcher, find some problems with the data or the documentation, and then run into barriers when working with the researcher to resolve the issues. In this case a deposit can move very slowly through ICPSR's machinery, and may take many years to emerge. However, even in an uncommon case such as this, we will have records that track and document the barriers so that there is formal institutional memory about the deposit.

Friday, November 5, 2010

University of Michigan CI Days

A few of us presented a poster and attended a recent symposium called CI Days at the University of Michigan. This was one of several similar events that the NSF has been funding across the country. The local event was hosted an organization led by Dan Atkins, who is back at the U-M after serving as the Director of the Office of Cyberinfrastructure at the National Science Foundation.

The event started with an evening reception and poster session on how staff and researchers at the University of Michigan are using cyberinfrastructure in their work. Our poster (shown above as a clickable image to the full-size poster) highlighted our work on our NIH Challenge Grant to study how one might use the cloud to more effectively share and protect confidential research data. Kudos to Jenna Tyson at ICPSR who designed and built the final submission.

Most of the posters were from grad students, many of whom were in the College of Engineering. In addition, ours had much, much less small-print text, and our experience was that this made it easier for people to stop by the poster, actually read the words, and then engage in a conversation. People expressed a lot of interest in how we were using the cloud, and how we were intending to protect the data.

A day-long event featuring keynote speakers, CI users, and CI providers followed the poster session. Jimmy Lin gave the opening keynote address. I found Jimmy's talk to be both interesting and educational. Jimmy proposed that computer scientists need a new paradigm for computing, something more abstract than the classic Von Neumann computing architecture. The problem is that software developers and computer scientists spend too much time debugging race conditions and locking rather than inventing new and better algorithms and methods. "The data center is the computer" was Jimmy's tag line.

I attended several breakout sessions, but the one with the best takeaway for me was Andy Caird's session on Flux, a developing bit of U-M CI for large-scale batch processing on a Linux cluster. At ICPSR I don't often come across requests for significant computational resources like this, but the next time that I do, I'm going to contact Andy.

The final keynote was given by Larry Smarr. To again grossly summarize a very interesting talk, just like I did with Jimmy's above, Larry's talk encouraged the audience to bridge the gap between the high speed networks that one finds in a machine room or within the control plane of a computing or storage cluster and the much slower networks that form our campus intra- and inter-connects. Larry gave several compelling examples of where very high speed end-to-end network connections enabled capabilities and interactions well beyond the common videoconference and the downright prosaic teleconference.

The event wrapped up with a town hall meeting where the audience was invited to give feedback on the event. I gave the organizers high marks for putting together a nice blend of presentations and informal networking time (not unlike CNI meetings), and suggested that at future events they invite speakers from off-campus who can present "success stories" of how they are using or deploying CI at their institution.