Google+ Followers

Friday, December 10, 2010

ICPSR Technology Resources

I have been working on some boiler-plate text that we plug into grant and contract applications, and it felt like it might make for a mildly interesting blog post.  We are also working on a nice diagram to accompany this narrative, and once that is put together, I'll pass that along too.  Here it is:

ICPSR operates an extensive computing environment.  Our machine room contains dozens of multi-core Linux servers supporting the curation, preservation, and dissemination of social science research data and documentation.  Over 120 desktop workstations access these servers and our EMC NAS storage system, which features over 50 terabytes of capacity. Oracle is our enterprise database management system, and we support all major statistical packages. We are connected to high-speed national and regional data networks via redundant multi-gigabit circuits, and the University of Michigan monitors our production computing systems 24 hours/day, seven days per week, notifying an on-call systems engineer in the event of system degradation.
ICPSR takes advantage of its position within the University of Michigan to gain access to a variety of systems and services, such as preferential software licensing terms, help desk and trouble ticket systems, network vulnerability scanning, desktop virtualization services, managed networks and firewalls, hosted content management systems, collaboration and courseware (Sakai) systems, and desktop workstation software and patch management systems.
ICPSR makes regular use of commercial computing clouds to provision elements of its  cyberinfrastructure, particularly services that require high availability for delivering public-use content.  ICPSR also has regular access to special-purpose, cloud-based infrastructure (Chronopolis, DuraCloud) to support its digital preservation mission.

Wednesday, December 8, 2010

DuraCloud pilot update - December 2010

I'm getting together with my DuraCloud pilot colleagues the morning before CNI starts.  I just saw the agenda for the meeting, and it looks like it will be a fruitful and interesting set of conversations.

I also had a very nice chat with Carol Minton Morris from Duraspace about our expectations from the DuraCloud pilot project, and our experience to date.  You can find her write-up of that conversation here on the DuraSpace blog.

Friday, December 3, 2010

TRAC: B2.8: Capturing Representation Information

B2.8 Repository records/registers Representation Information (including formats) ingested.

When international standards for the associated Representation Information are not available, the repository needs to capture such information and register it so that it is readily findable and reusable. Some of it may be incorporated into software. The Representation Information is critical to the ability to turn bits into usable information and must be permanently associated with the Content Information.

Evidence: Viewable records in local format registry (with persistent links to digital objects); local metadata registry(ies); database records that include Representation Information and a persistent link to relevant digital objects.



As noted in last week's post, we capture representation information in both IANA MIME type form and also in a more human-readable form. We are also looking at adding an additional piece of representation information to the metadata that surrounds the files deposited at ICPSR:  file type or file role.

In brief, the idea is to capture the high-level concept behind the role the file plays in research.  For example, it may be nice to know that a given file is an Excel workbook, but it is also important to know whether the file contains data, documentation, a database of sorts, or some combination of things.  An Excel file that contains nothing but columns of numbers and text might be normalized quite easily into a more durable format.  An Excel file that contains nothing but text and images and descriptions of a data file might be converted to PDF/A or TIFF or some other format.

This idea has been used with the derived content that ICPSR produces for a very long time, but ICPSR is just now exploring the required changes in business process to do this for deposited files as well.  More as this story develops....

Wednesday, December 1, 2010

ICPSR Deposits - November 2010

Here is a snapshot of the types of deposits and deposited files that ICPSR received in November 2010.

My sense is that this was a fairly typical month both in terms of volume and the types of files.  We always get lots of documentation and related text in one of the common formats, such as PDF and MS Word.  And the main sources of data come in one of the big stat packages, like SAS and SPSS, or in plain text.


# of files# of depositsFile format
32application/msoffice
12014application/msword
20233application/pdf
32application/vnd.ms-excel
226application/x-sas
4013application/x-spss
31application/x-stata
22message/rfc822\0117bit
11text/html
31text/html; charset=utf-8
55text/plain; charset=iso-8859-1
168text/plain; charset=unknown
11031text/plain; charset=us-ascii
481text/plain; charset=utf-8
11text/rtf
11text/x-c++; charset=us-ascii

Wednesday, November 24, 2010

TRAC: B2.7: Format registries

B2.7 Repository demonstrates that it has access to necessary tools and resources to establish authoritative semantic or technical context of the digital objects it contains (i.e., access to appropriate international Representation Information and format registries).

The Global Digital Format Registry (GDFR), the UK National Archives’ file format registry PRONOM, and the UK Digital Curation Centre’s Representation Information Registry are three emerging examples of potential international standards a repository might adopt. Whenever possible, the repository should use these types of standardized, authoritative information sources to identify and/or verify the Representation Information components of Content Information and PDI. This will reduce the long-term maintenance costs to the repository and improve quality control.

Most repositories will maintain format information locally to maintain their independent ability to verify formats or other technical or semantic details associated with each archival object. In these cases, the use of international format registries is not meant to replace local format registries but instead serve as a resource to verify or obtain independent, authoritative information about any and all file formats.

Evidence: Subscription or access to such registries; association of unique identifiers to format registries with digital objects.



The volume of content entering ICPSR is relatively low, perhaps 100 submissions per month.  In some extreme cases, such as with our Publication Related Archive, ICPSR staff spend relatively little time reviewing and normalizing content, and it is released "as is" on the web site, and it gets the most modest level of digital preservation (bit-level only, unless the content happens to be something more durable, such as plain text).  However, in most cases, someone at ICPSR is opening each file, reading and reviewing documentation, scrubbing data for disclosure risk, recoding, etc, etc.  It is a very hands-on process.

Because of the low volumes and high touches, automated format detection is not at all essential, at least for the current business model.  Nonetheless we do use automated format detection for both the files that we receive via our deposit system, and for the derivative files we produce internally.  And our tool for doing this is the venerable UNIX command-line utility file.

Why?

The content that we tend to receive is a mix of documentation and data.  The documentation is often in PDF format, but sometimes arrives in common word processor formats like DOC and DOCS, and sometimes less common word processor formats.  The data is often in a format produced by common statistical packages such as SAS, SPSS, and Stata.  And we also get a nice mix of other file formats from a wide variety of business applications like Access, Excel, PowerPoint, and more.

We have found the vanilla file that ships with Red Hat Linux to be pretty good at most of the formats that show up on our doorstep.  We've extended the magic database that file consults so that it does a better job understanding a more broad selection of stat package formats.  (file does OK, but not great in this area.)  We also have extended the magic database and wrapped file in a helper tool  -- we call the final product ifile, for improved file -- so that it does a better job identifying the newer Microsoft Office file formats like DOCX, XLSX, PPTX, and so on.

I would love to be able to use an "off the shelf" tool like jhove or droid to identify file formats, relying upon a global registry for formats.  There isn't much glamor in hacking the magic database.

However, my experience thus far with jhove, jhove2, and droid is that they just don't beat ifile (or even file) for the particular mix of content we tend to get.  Those packages are much more heavy-weight, and while they do a fabulous job on some formats (like PDF), they perform poorly or not at all with many of the formats we see on a regular basis.

As a test I took a week or two of the files most recently deposited at ICPSR, and I had file, ifile, and droid have a go at identifying them.  I originally had jhove2 in the mix as well, but I could not get it to run reliably.  (And my sense is that it may be using the same data source as droid for file identification anyway.)  Of the 249 files I examined, ifile got 'em all right, file misreported on 48 of them, and droid misreported 104 files.  And the number for droid gets worse if I ding it for reporting a file containing an email as text/plain rather than text/rfc822.

So in the end we're using our souped-up version of file for file format identification, and we're using IANA MIME types as our primary identifier for file format.  We also capture the more verbose human-readable output from our own ifile as well since it can be handy to have "SPSS System File MS Windows Release 10.0.5" rather than just application/x-spss.

Tuesday, November 23, 2010

The Cloud and Archival Storage

Price.  Availability.  Services.  Security.

These are the four parameters that I use when deciding where to store one of our archival storage copies.

For me the cloud is just another storage container.  Fundamentally it is no different from a physical storage location except in how it differs across these four dimensions.  In fact, I can conceptualize my "non-cloud" storage locations as storage as a service cloud providers, but where the provider is a lot more local than the big names in "cloud" today:

ICPSR Cloud:  This is the portion of the EMC NS-120 NAS that I use for a local copy of archival storage.  It is very expensive with a reasonably high-level of availability.  It provides very few services; if I want to perform a fixity check of the objects I have stored here, I have to create and schedule that myself.  Because I have physical control over the ICPSR Cloud, I have an irrational belief that it is probably secure, even though I know that ICPSR isn't as physically secure as many other companies at which I have worked.  Certainly ICPSR does not make any statements or guarantees about ISO 27001 compliance.

UMich Cloud:  This is a multi-TB chunk of NFS file storage that I rent from the University of Michigan's Information Technology Services (ITS) organization.  They call it ITS Value Storage.  The price here is excellent, but the level of availability is just a hair lower.  I don't notice the lower level of availability most of the time, but I do perceive it when running long-lived, I/O-intensive applications.  Like my own cloud, this one has no services unless I deploy them myself.  Because I do not have physical control over the equipment, or even know exactly where the equipment is (beyond a given data center), it feels like there is less control.  ITS makes no promises about ISO 27001 compliance (or promises about other standards), but my sense is that their controls and physical security and IT management processes must be at least as good as mine.  After all, they are managing many, many TBs for many different university departments and organizations, including themselves.

Amazon Cloud:  This is a multi-TB chunk of Elastic Block Storage (EBS) that I rent from Amazon Web Services.  I use EBS rather than the Simple Storage Service (S3) because I want the semantics of a filesystem so that I don't have to worry about things like files that are large or that have funny characters in their names.  The price here is good, better than my EMC NAS, but not as good as the ITS Value Storage.  The availability is quite good overall, but, of course, the network throughput between ICPSR and AWS is nowhere near as good as intra-campus networking, and it is even worse for the AWS EU location.  The services are no better and no worse than my own cloud or the UMich cloud.  Like the ITS Value Storage service I have no control over the physical systems, and I know even less about their physical location.  Amazon says that it passed a SAS 70 audit, and recently received an ISO 27001 certification.  This seems to be a better security story than anyone else so far.

DuraCloud:  Unlike the other clouds, I'm not using this one for archival storage; it is still in a pilot phase.  The availability is similar to plain old AWS (which hosts the main DuraCloud service), and the price is still under discussion.  My expectation is that the level of security is no better (and no worse) than the underlying cloud provider(s), and so depending upon which storage provider one selects, one's mileage may vary.  However, the really interesting thing about DuraCloud is the idea of the services.  If DuraCloud can execute and deliver useful, robust services on top of basic storage clouds, that will be a true value-add, and will make this a very compelling platform for archival storage.

Chronopolis:  Like DuraCloud, this too is not in production yet, and is being groomed (I think) as a future for-fee, production service.  I don't have as much visibility here with regard to availability since I am not actively moving content in and out of Chronopolis; most of the action seems to be taking please under the hood between the storage partner locations.  My sense is that the level of security is probably similar to the UMich Cloud since the lead organization, the San Diego Supercomputer Center, is in the world of higher education, like UMich, but it may well be the case that they have a stronger security story to tell, and I just don't know it.  And like DuraCloud, my sense is that it will come down to services:  If Chronopolis builds great services that facilitate archival storage, that will make it an interesting choice.

Monday, November 22, 2010

Fedora Object for overall deposit - example

So far we have seen objects for each of the files belonging to a single deposit, but we have not yet seen a parent-level or aggregate-level container for the deposit itself.  Most of the information we have collected has so far been at the file-level, but what about information that relates to the entire set of files?

For this purpose we'll use a separate Fedora Object.  It's RDF will assert a relationship to each of the file-level Fedora Objects, and as we have already seen, each of them asserts a relationship to this object on the left.

At this time we are using a generic Fedora Object for the higher level deposit object, but it may make sense to create a separate Content Model for it if we know that it must always have a Datastream that contains a history of deposit-related actions.

For the example object we've created the object to the left.  (The object is also a hyperlink to our public Fedora repository.)

The Dublin Core Datastream is relatively empty.  Most of the content is captured in a PREMIS XML Datastream at the end.  At the time of this post the object contains PREMIS which has been built by hand, and so may not be quite correct.  But if the syntax isn't quite correct, we think that the concept is.

The PREMIS is very basic.  An obvious enhancement would be to add additional stanzas to capture the terms to which the depositor agreed to help flesh out the Access Rights portion of the Preservation Description Information (PDI).

Next up I'll share our summary of how we think we will generate a Submission Information Package for each of the items in the deposit.

Friday, November 19, 2010

TRAC: B2.6: Matching unique SIP IDs to AIP IDs

B2.6 If unique identifiers are associated with SIPs before ingest, the repository preserves the identifiers in a way that maintains a persistent association with the resultant archived object (e.g., AIP).

SIPs will not always contain unique identifiers when the repository receives them. But where they do, and particularly where those identifiers were widely known before the objects were ingested, it is important that they are either retained as is, or that some mechanism allows the original identifier to be transformed into one used by the repository.

For example, consider an archival repository whose SIPs consist of file collections from electronic document management systems (EDMS). Each incoming SIP will contain a unique identifier for each file within the EDMS, which may just be the pathname to the file. The repository cannot use these as they stand, since two different collections may contain files with the same pathname. The repository may generate unique identifiers by qualifying the original identifier in some way (e.g., refixing the pathname with a unique ID assigned to the SIP of which it was a part). Or it may simply generate new unique numeric identifiers for every file in each SIP. If it qualifies the original identifier, it must explain the scheme it uses. If it generates entirely new identifiers, it will probably need to maintain a mapping between original IDs and generated IDs, perhaps using object-level metadata.

Documentation must show the policy on handling the unique identification of SIP components as the objects to be preserved are ingested, preserved, and disseminated. Where special handling is required, this must be documented for each SIP as a part of the provenance information capture (see B2.3).

Evidence: Workflow documents and evidence of traceability (e.g., SIP identifier embedded in AIP, mapping table of SIP IDs to AIPs).



My sense is that ICPSR is in pretty good shape on this requirement.  Here is a stab at the documentation.

ICPSR collects the original file name (sanitized to avoid problems via SQL injection and other attacks) for each item at the time of deposit.  Each item also receives a unique ID in our deposit tracking system.  And the deposit event itself also receives a unique ID which is also exposed to the depositor. 

During the data curation process an ICPSR data manager will mint a new container-level object for access purposes (i.e., a study), and will fill that container with materials derived from the deposit.  This activity can run the gamut from a one-to-one mapping between deposited files and released files to something far more complex.  In any event ICPSR records and preserves the mapping between the aggregate-level objects, deposits and studies.  ICPSR also has plans to record and preserve the mapping at the more fine-grained level of each file.

Thursday, November 18, 2010

Fedora Objects for deposited files - example

We are using deposit #15868 as an example to illustrate the Fedora Objects we will use to store social science research data and documentation that has been deposited at ICPSR.  We chose this deposit since each of the files are readily available on a public web site and pose no risk of disclosure.  Each of the images below is also a hyperlink to the corresponding object in our public Fedora Commons repository.

The first file in the deposit contains the survey data.  We assign a unique ID (Reference) for the file that will not change.

In this case the survey data are in a format produced by the statistical analysis software called SAS, and our file format identification software has assigned it the MIME type of application/x-sas.  This content goes into its own Datastream (last one on the left), and we Fedora calculates a message digest to fingerprint the file (Fixity).

We note the original name of the file in the DC Datastream along with the identity of the depositor, the origin of the file, and the identity of the organization that created the file (Provenance).

 We capture its relationship to the higher-level deposit transaction via a relationship in the RELS-EXT Datastream, and we also later capture what role this file plays in the data curation lifecycle at ICPSR (Context).

Not captured or shown at this level are the terms to which the deposit agreed when transferring this content to ICPSR (Access Rights).  We will store those in the aggregate-level object. Typically the depositor grants ICPSR non-exclusive rights to reproduce and publish the content, but this is not exclusively true.

Likewise, we capture similar information for the other three files in the deposit:


In the next blog post on this topic, I'll publish a description of the aggregate object to which these four assert an isPartOf relationship.  Once we have that object as well, we can begin talking about producing an OAIS Submission Information Package (SIP) for each object.

Wednesday, November 17, 2010

Fedora objects for deposits

Researchers and government agencies (and their proxies at ICPSR) use a web portal called the Data Deposit Form to transfer content to ICPSR.  The form contains many opportunities for a depositor to enter metadata about the transfer, but only a few are required:  the name of the depositor and a name for the deposit.

A deposit may have an arbitrary number of files, and those files may be uploaded individually or as a single "archive" file, such as a Zip or GNU Zip archive.  In a case where the depositor uploads an archive file, ICPSR unpacks it to extract the actual content.  And if the archive file contains an archive file, ICPSR systems continue unpacking recursively.

Our intention is to put each of the deposited files (unpacked, if necessary) in its own Fedora object.  This object will be an off-the-shelf object without any special Content Model.  Here is an example:


(Note that all of the images are also hyperlinks to Fedora Objects in our public Fedora Commons repository.)

This is a standard Fedora Object, conforming only to the Content Model for all objects.

Each deposited file contains a unique ID captured in the PID, and the usual, minimal Fedora object properties.

We also enable the Audit Datastream to record any changes to the object, and use the DC (Dublin Core) Datastream to capture some of the metadata we collect via our Data Deposit Form.

We use a relationship expressed in the RELS-EXT Datastream to point to a parent-level object which is used to link the files within a single deposit and to capture any metadata which applies to the entire deposit, not just the individual files.

The content is highly variable.  In addition to receiving survey data in plain text, ICPSR also receives data in a variety of proprietary formats (e.g., SAS) and related documentation in a wide array of formats (word processor output, plain text documents, and many others).

To illustrate the example further, we created a Fedora Object for each of the files found in one of our recent deposits.  We selected this deposit because the content is entirely public-use, and is readily available from a public web site.  The deposit is also a nice size (only four files).  To keep this blog post at a reasonable size, I'll save the example for tomorrow's post.

Tuesday, November 16, 2010

Fedora Content Model for Social Science Research Data - Redux

A group of us have been getting together once per week for the past month or two to revisit some of our earlier decisions about social science research data and how we intend to store it in Fedora.  (You can find the original content model by searching the blog for the tag 'eager' -- this work is supported by an NSF INTEROP EAGER grant.)

Our thinking about the type of Fedora objects that we would like to use has shifted from our first thoughts in 2009.  The original objects aimed to group related content within the same object, but in different Datastreams.  We are now thinking of using much simpler objects where the content forms one Datastream, and any related content is packed into its own separate objects, linked together using the RDF syntax available in RELS-TXT.  If the file-level object has metadata which doesn't fit well into existing places, then we may create a second Datastream to collect it.  For example, if we want to record the preservation actions performed on the file/object,we think it makes sense to capture that in a PREMIS-format Datastream stored alongside the actual file/object content.

I'll kick-off this continuing line of posts tomorrow with an example container for what we call a Deposit at ICPSR.  This is the container that a researcher, government agency, or even an ICPSR staffer use to move content into the data curation systems of ICPSR.

Monday, November 15, 2010

Heading to CNI in December

I'll be giving a talk about what we're doing with Fedora for social science data at the Fall 2010 Coalition for Networked Information member meeting.  We've been working through a different content model architecture for data and its related documentation, and I'm intending to post some updated objects with links to our public Fedora repository next week.

We have also been spending some time talking about OAIS packages - DIPs, AIPs, and SIPs - for content that is deposited at ICPSR.  I think this too will make an interesting post, and I'll get something up about this before the end of the month.

Friday, November 12, 2010

TRAC: B2.5: AIP Naming Conventions

B2.5 Repository has and uses a naming convention that generates visible, persistent, unique identifiers for all archived objects (i.e., AIPs).

A repository needs to ensure that an accepted, standard naming convention is in place that identifies its materials uniquely and persistently for use both in and outside the repository. The “visibility” requirement here means “visible” to repository managers and auditors. It does not imply that these unique identifiers need to be visible to end users or that they serve as the primary means of access to digital objects.

Equally important is a system of reliable linking/resolution services in order to find the uniquely named object, no matter its physical location. This is so that actions relating to AIPs can be traced over time, over system changes, and over storage changes. Ideally, the unique ID lives as long as the AIP; if it does not, there must be traceability. The ID system must be seen to fit the repository’s current and foreseeable future requirements for things like numbers of objects. It must be possible to demonstrate that the identifiers are unique. Note that B2.1 requires that the components of an AIP be suitably bound and identified for long-term management, but places no restrictions on how AIPs are identified with files. Thus, in the general case, an AIP may be distributed over many files, or a single file may contain more than one AIP. Therefore identifiers and filenames may not necessarily correspond to each other.

Documentation must show how the persistent identifiers of the AIP and its components are assigned and maintained so as to be unique within the context of the repository. The documentation must also describe any processes used for changes to such identifiers. It must be possible to obtain a complete list of all such identifiers and do spot checks for duplications.

Evidence: Documentation describing naming convention and physical evidence of its application (e.g., logs).



ICPSR generates a unique ID for each file that we receive via a deposit, and a unique ID for each post-processed file that we create.  Files are stored in well-defined locations in archival storage, and between the location and filename (which also follows a set of standard conventions within ICPSR), one can identify considerable provenance information which is also replicated in a database.

Specifically, each file that has been deposited at ICPSR - if retained for preservation - has a unique ID stored in a database, and has a unique location in archival storage:  deposits/depositID/originalFilenameSanitized.  The root of the location varies depending upon the physical location of the copy in archival storage.  For example, a copy stored locally at ICPSR may have a URI root of file://nas.icpsr.umich.edu/archival-storage while a copy stored in an off-site archival location will have a different URI root.

Likewise for content that is produced by ICPSR staff - if retained for preservation - has a similar unique ID and unique location.

Tuesday, November 9, 2010

Just for fun: Conan O'Brien premiere cold open

I came across a link to this today in Kara Swisher's BoomTown blog.  It has such a nice connection to The Godfather, I can't help but share the video too.

Monday, November 8, 2010

TRAC: B2.4: SIPs need to become AIPs

B2.4 Repository can demonstrate that all submitted objects (i.e., SIPs) are either accepted as whole or part of an eventual archival object (i.e., AIP), or otherwise disposed of in a recorded fashion.

The timescale of this process will vary between repositories from seconds to many months, but SIPs must not remain in a limbo-like state forever. The accessioning procedures and the internal processing and audit logs should maintain records of all internal transformations of SIPs to demonstrate that they either become AIPs (or part of AIPs) or are disposed of. Appropriate descriptive information should also document the provenance of all digital objects.


Evidence: System processing files; disposal records; donor or depositor agreements/deeds of gift; provenance tracking system; system log files.



This TRAC requirement moves us away from the technology area and into business processes in other parts of ICPSR.  As such my critique comes more from a perspective of an informed outsider rather than the responsible party.

My sense is that ICPSR has a good review process in place such that deposits are tracked on a regular basis by our acquisitions staff.  If a deposit becomes stuck - which can happen for all sorts of different reasons, only some of which are under the control of ICPSR - the acquisitions team makes sure that it does not fall off the radar screen.

That said, it is certainly possible for ICPSR to receive an unsolicited deposit from a researcher, find some problems with the data or the documentation, and then run into barriers when working with the researcher to resolve the issues.  In this case a deposit can move very slowly through ICPSR's machinery, and may take many years to emerge.  However, even in an uncommon case such as this, we will have records that track and document the barriers so that there is formal institutional memory about the deposit.

Friday, November 5, 2010

University of Michigan CI Days


A few of us presented a poster and attended a recent symposium called CI Days at the University of Michigan.  This was one of several similar events that the NSF has been funding across the country.  The local event was hosted an organization led by Dan Atkins, who is back at the U-M after serving as the Director of the Office of Cyberinfrastructure at the National Science Foundation.

The event started with an evening reception and poster session on how staff and researchers at the University of Michigan are using cyberinfrastructure in their work.  Our poster (shown above as a clickable image to the full-size poster) highlighted our work on our NIH Challenge Grant to study how one might use the cloud to more effectively share and protect confidential research data.  Kudos to Jenna Tyson at ICPSR who designed and built the final submission.

Most of the posters were from grad students, many of whom were in the College of Engineering.  In addition, ours had much, much less small-print text, and our experience was that this made it easier for people to stop by the poster, actually read the words, and then engage in a conversation.  People expressed a lot of interest in how we were using the cloud, and how we were intending to protect the data.

A day-long event featuring keynote speakers, CI users, and CI providers followed the poster session.  Jimmy Lin gave the opening keynote address.  I found Jimmy's talk to be both interesting and educational.  Jimmy proposed that computer scientists need a new paradigm for computing, something more abstract than the classic Von Neumann computing architecture.  The problem is that software developers and computer scientists spend too much time debugging race conditions and locking rather than inventing new and better algorithms and methods.  "The data center is the computer" was Jimmy's tag line.

I attended several breakout sessions, but the one with the best takeaway for me was Andy Caird's session on Flux, a developing bit of U-M CI for large-scale batch processing on a Linux cluster.  At ICPSR I don't often come across requests for significant computational resources like this, but the next time that I do, I'm going to contact Andy.

The final keynote was given by Larry Smarr.  To again grossly summarize a very interesting talk, just like I did with Jimmy's above, Larry's talk encouraged the audience to bridge the gap between the high speed networks that one finds in a machine room or within the control plane of a computing or storage cluster and the much slower networks that form our campus intra- and inter-connects.  Larry gave several compelling examples of where very high speed end-to-end network connections enabled capabilities and interactions well beyond the common videoconference and the downright prosaic teleconference.

The event wrapped up with a town hall meeting where the audience was invited to give feedback on the event.  I gave the organizers high marks for putting together a nice blend of presentations and informal networking time (not unlike CNI meetings), and suggested that at future events they invite speakers from off-campus who can present "success stories" of how they are using or deploying CI at their institution.

Friday, October 29, 2010

TRAC: B2.3: SIPs to AIPs

B2.3 Repository has a description of how AIPs are constructed from SIPs.

The repository must be able to show how the preserved object is constructed from the object initially submitted for preservation. In some cases, the AIP and SIP will be almost identical apart from packaging and location, and the repository need only state this. More commonly, complex transformations (e.g., data normalization) may be applied to objects during the ingest process, and a precise description of these actions (i.e., preservation metadata) may be necessary to ensure that the preserved object represents the information in the submitted object. The AIP construction description should include documentation that gives the provenance of the ingest process for each SIP to AIP transformation, typically consisting of an overview of general processing being applied to all such transformations, augmented with description of different classes of such processing and, when applicable, with special transformations that were needed.

Some repositories may need to produce these complex descriptions case by case, in which case diaries or logs of actions taken to produce each AIP will be needed. In these cases, documentation needs to be mapped between to individual AIPs, and the mapping needs to be available for examination. Other repositories that can run a more production-line approach may have a description for how each class of incoming object is transformed to produce the AIP. It must be clear which definition applies to which AIP. If, to take a simple example, two separate processes each produce a TIFF file, it must be clear which process was applied to produce a particular TIFF file.

Evidence: Process description documents; documentation of SIP relationship to AIP; clear documentation of how AIPs are derived from SIPs; documentation of standard/process against which normalization occurs; documentation of normalization outcome and how outcome is different from SIP.



Note:  This particular area is under active discussion at ICPSR.  The commentary below describes current processes in place at ICPSR, but these processes are likely to change in the future.

Content enters archival storage at two different points in the lifecycle of a "data processing project" at ICPSR.

When a Deposit is signed, its payload of content (its files) enter archival storage for bit-level preservation.  The system generates a unique ID for each deposited file and also keeps track of each file's MIME type, digital signature (via MD5 hash), original name (but sanitized to avoid SQL injection attacks and other like problems), and the date it was received.

A cognizant archive manager assigns each new deposit to a data manager.  In the most easy, most trivial case, a data manager may package the submission for long-term preservation and release on the ICPSR web site with little added work.  The data manager packages the content into an ICPSR "study" object, collaborates with others at ICPSR to author descriptive and preservation metadata, and performs a series of quality control checks, some of which are automated.  Workflow tools record major milestones in the life of the project, and the data manager creates an explicit linkage between deposit and study for future reference.  And the system also assigns a unique ID to each of these "processed" files, and captures metadata like digital signature, MIME type, etc.

Thus, at the aggregate level, ICPSR collects strong documentation mapping submission objects to archival objects, but the documentation is much weaker, and often absent, at the more detailed level of files.  For example, there is no explicit mapping between deposit file X and study file Y.

Monday, October 25, 2010

DuraCloud pilot update - October 2010

ICPSR's participation in the DuraCloud pilot is coming along nicely.  While we were not one of the original pilot members, we were one of the "early adopters" in the second round of pilot users.  We've been using DuraSpace pretty actively since the summer.

The collection we selected for pilot purposes is a subset of our archival storage that contains preservation copies of our public-use datasets and documentation.  This is big but not too big (1TB or so), and contains a nice mix of format types, such as plain text, XML, PDF, TIFF, and more.  At the time of this post, we have 72,134 files copied into DuraCloud.

I've been using their Java-based command-line utility called the synctool to synchronize some of our content with DuraCloud.  I found it useful to wrap the utility in a small shell script so that I do not need to specify as many command-line arguments when I invoke it.  I tend to use sixteen threads to synchronize content rather than the default three, and while that places a heavy load on our machine here, it leads to faster synchronization.  The synctool assumes an interactive user, and has a very basic interface for checking status.

Overall I like the synctool but wish that it had an option that did not assume an interactive user; something I could run out of cron like I often do with rsync.  Because the underlying storage platform (S3) limits the size of files, synctool is not able to copy some of our larger files.  I wish synctool would "chunk up" the files into more manageable pieces, and sync them for me.  One reason I don't use raw S3 for storage is because of this file size limitation; instead I like to spend a little more money and attach an Elastic Block Storage volume (S3-backed) to a running instance, and then use the filesystem to hide the limitation.  Then I can just use standard tools, like rsync, to copy very large files into the cloud.

The DuraCloud folks have been great collaborators:  extremely responsive, extremely helpful; just a joy to work with.  They've told me about a pair of upcoming features that I'm keen to test.

One, their fixity service will be revamped in the 0.7 release.  It'll have fewer options and features, but will be much easier to use.  I'm eager to see how this compares to a low-tech approach I use for our archival storage: weekly filesystem scans + MD5 calculations compared to values stored in a database.

Two, their replicate-on-demand service is coming, and ICPSR will be the first (I think) test case to replicate its content from S3 to Azure's storage service.  I have not had the opportunity to use Microsoft's cloud services at all, and am looking forward to seeing how it performs.

Friday, October 22, 2010

TRAC: B2.2: Are our AIPs adequate to the job?

B2.2 Repository has a definition of each AIP (or class) that is adequate to fit long-term preservation needs.

In many cases, if the definitions required by B2.1 exist, this requirement is also satisfied, but it may also be necessary for the definitions to say something about the semantics or intended use of the AIPs if this could affect long-term preservation decisions. For example, say two repositories both only preserve digital still images, both using multi-image TIFF files as their preservation format. Repository 1 consists entirely of real-world photographic images intended for viewing by people and has a single definition covering all of its AIPs. (The definition may refer to a local or external definition of the TIFF format.) Repository 2 contains some images, such as medical x-rays, that are intended for computer analysis rather than viewing by the human eye, and other images that are like those in Repository 1. Repository 2 should perhaps define two classes of AIPs, even though it only uses one storage format for both. A future preservation action may depend on the intended use of the image—an action that changes the bit-depth of Trustworthy Repositories Audit and Certification: Criteria and Checklist the image in a way that is not perceivable to the human eye may be satisfactory for real-world photographs but not for medical images, for example.

Evidence: Documentation that relates the AIP component’s contents to the related preservation needs of the repository, with enough detail for the repository's providers and consumers to be confident that the significant properties of AIPs will be preserved.



This item is somewhat difficult to discuss without a crisp set of definitions for AIPs.  However, given the longevity of ICPSR as a digital repository, and given the track record (nearly fifty years) of preserving and delivering content, the empirical evidence would seem to indicate that the models and containers we are using for our content are a good fit for our long-term preservation needs.

In some ways ICPSR has a relatively easy task here since our content is pretty homogeneous (survey data and documentation) and we are able to normalize it into very durable formats like plain text and TIFF.  And because our content is all "born digital" and delivered digitally, there's fewer opportunities for things to go really awry.

We also create a great deal of descriptive metadata that we bundle with out content, and our content is highly curated compared to, say, an enormous stream of data coming from highway sensors or satellites.  In addition to making items easier to find and to use, it may also help keep them more durable as a side-effect.

As part of an NSF Interop/EAGER grant we're defining Fedora Objects for our most common content types, and for each object, we are also working through the specifications of the AIP.  My sense is that this will help us formalize some of our current practices, and will help illuminate any gaps where we should be collecting and saving metadata, but aren't today.  And that will help further inform the response to this TRAC item.

Wednesday, October 20, 2010

I'll Miss Swivel

A few years ago the ICPSR Director at the time (Myron Gutmann) told me about a new data visualization service he had come across:  Swivel.  I found the site, created an account, and started playing with the tools they made available to visualize data.

It seemed like a nice little service, but not much of a competitor for the type of clients that ICPSR typically serves.  For one thing, all of the datasets needed to fit into Excel.

I did take the opportunity to create a public-use dataset of my own.  It was just a little toy dataset that had one row for each year, and where the columns were the annual dues for our neighborhood association for that year, that same amount of money expressed in 1993 CPI dollars, and the maximum amount the dues could have been for that year ($200 in 1993 adjusted by the CPI).  This made it easy to create graphs and images that showed how little the neighborhood dues had gone up over the years.

However, Swivel is no more.  Navigating to the home page of their web site just times out.  I found a nice piece by Robert Kosara where he talks to the founders about what Swivel was, and where things went wrong.  It is a short, interesting read:  the punchline is that they just didn't have any customers.

I think I could probably create the same dataset at Zoho or on Google Docs, but neither one of those has quite the same nice set of features for visualizing the data as Swivel did.

Friday, October 15, 2010

TRAC: B2.1: AIPs

B2.1 Repository has an identifiable, written definition for each AIP or class of information preserved by the repository.

An AIP contains these key components: the primary data object to be preserved, its supporting Representation Information (format and meaning of the format elements), and the various categories of Preservation Description Information (PDI) that also need to be associated with the primary data object: Fixity, Provenance, Context, and Reference. There should be a definition of how these categories of information are bound together and/or related in such a way that they can always be found and managed within the archive.

It is merely necessary that definitions exist for each AIP, or class of AIP if there are many instances of the same type. Repositories that store a wide variety of object types may need a specific definition for each AIP they hold, but it is expected that most repositories will establish class descriptions that apply to many AIPs. It must be possible to determine which definition applies to which AIP.

While this requirement is primarily concerned with issues of identifying and binding key components of the AIP, B2.2 places more stringent conditions on the content of the key components to ensure that they are fit for the intended purpose. Separating the two criteria is important, particularly if a repository does not satisfy one of them. It is important to know whether some or all AIPs are not defined, or that the definitions exist but are not adequate.

Evidence: Documentation identifying each class of AIP and describing how each is implemented  within the repository. Implementations may, for example, involve some combination of files,  databases, and/or documents.



Does anyone have written definitions for their AIPs?

I found a preliminary design document at the Library of Congress via a Google search that had a very long, very complete description of a proposed AIP for image-type content.  But in general it seems hard to find real world examples of AIPs that are in use at working archives.  Perhaps they are out there, but published in such a way that makes it difficult to discover them?

Here is my strawman stab at defining an AIP for the bulk of ICPSR's content:  social science research data and documentation.  This is very much a work-in-progress and should not be read as any sort of official document.  Here goes:

Definition of an Archival Information Package (AIP) for a Social Science Study

We define an AIP for a social science study as a list of files where each file has supporting representation information in the form of:
  • a role (data, codebook, survey instrument, etc)
  • a format (we use MIME type)
 and has the following Preservation Description Information:
  • Provenance.  We link processed studies to initial deposits at aggregation-level, and we also collect processing history in our internal Study Tracking System which records who performed actions on the content, and major milestones in its lifecycle at ICPSR.
  • Context.  We store related content together in the filesystem, and a good deal of the context embedded in both the name of each file and in a relational database.  While not in production, we are evaluating the use of RDF/XML as a method for recording and exposing contextual information.
  • Reference.  Each file has a unique ID.
  • Fixity.  We use an MD5 hash at file-level to capture and check integrity.
So there's the strawman.  To help guide my description of the PDI, I used these definitions from the Open Archival Information System (OAIS) specification:
– Provenance describes the source of the Content Information, who has had custody of it since its origination, and its history (including processing history).
– Context describes how the Content Information relates to other information outside the Information Package. For example, it would describe why the Content Information was produced, and it may include a description of how it relates to another Content Information object that is available.
– Reference provides one or more identifiers, or systems of identifiers, by which the Content Information may be uniquely identified. Examples include an ISBN number for a book, or a set of attributes that distinguish one instance of Content Information from  another.
– Fixity provides a wrapper, or protective shield, that protects the Content Information from undocumented alteration. For example, it may involve a check sum over the Content Information of a digital Information Package

Wednesday, October 13, 2010

Designing Storage Architectures for Digital Preservation - Day Two, Part Two

The final session of the conference featured six speakers.

  1. Jimmy Lin (University of Maryland) is spending some time at Twitter, and described their technology stack: hardware, HDFS, Hadoop, and pig, which he described as the "perl/python of big data."
  2. Mike Smorul (University of Maryland) gave an overview of their "time machine for the web" and the challenges of managing a web archive
  3. John Johnson (Pacific Northwest National Laboratory) proposed that the scientific process has changed in that data produced by computation is now one of the drivers for creating and testing new theories
  4. Leslie Johnston (Library of Congress) spoke briefly about an IBM emerging technology called "big sheets"
  5. Dave Fellinger (DataDirect Networks) urged the audience to "don't be afraid to count machine cycles" when analyzing storage systems for bottlenecks that increase service latency
  6. Kevin Kambach (Oracle) finished the session with industry notes about large data
The day then concluded with two final talks.  One was from Subodh Kulkarni (Imation) who gave an overview of storage technology from magnetic tape to hard disk, and the other was from David Rosenthal (LOCKSS) who gave an abbreviated version of his iPres talk, "How Green is Digital Preservation?"  David mentioned a very interesting, large-scale, low-power computing and storage platform being produced by a company called Seamicro.

Tuesday, October 12, 2010

Designing Storage Architectures for Digital Preservation - Day Two, Part One

The Library of Congress hosted a two-day meeting on September 27 and 28, 2010 to talk about technologies, strategies, and techniques for managing storage.  Like the 2009 meeting, which I also attended, the meeting is heavily focused on IT and the costs of the the technology.  This was another interesting and valuable meeting, but it always feels like we don't address the elephant in the room:  the cost of all of the people who curate content, create metadata, manage collections, assess content, etc.  This is the report from the second day of the conference.

The morning session of the second day of the conference featured six speakers, many from industry:  
  1. Micah Beck (University of Tennessee - Knoxville) made an argument for "lossy preservation" as a strategy for achieving "good enough" digital preservation in an imperfect world, and suggested that developing techniques for using damaged objects should be part of the archivists' toolkit.
  2. Mike Vamdamme (Fujifilm) gave an overview of their StorageIQ product as a system to augment the reporting and metadata available from conventional tape-based backup and storage systems
  3. Hal Woods (HP) spoke about StorageWorks
  4. Mootaz Elnozahy (IBM) spoke about trends in reliable storage over the next 5-10 years, and predicted that power management requirements will stress hardware causing the rate of MTBF, and the soft error rates of storage to increase.
  5. Dave Anderson (Seagate) also spoke about near-term trends such as a shift to 3TB disks and 2.5" form-factor drives.  He does not see solid state as a factor in the market at this time.
  6. Mike Smorul (University of Maryland) gave a very brief overview of ACE.
The next session featured four more speakers:
  1. Joe Zimm (EMC) was part of Data Domain before being acquired by EMC, and spoke about EMC's block-level de-duplication technology.
  2. Mike Davis (Dell) was part of Ocarina before being acquired by Dell, and spoke about their technology for de-duplication.
  3. Steve Vranyes (Symantec) opined that compression will play a more significant role than de-duplication in easing storage requirements for archives because the use case is very different.
  4. Raghavendra Rao (Cisco) introduced Cisco's network layer de-duplicator.  This seemed like an odd fit in some ways compared to the other products.
Up next - the final post in this series:  the second half of Day Two.

Friday, October 8, 2010

TRAC: B1.8: Ingest records

B1.8 Repository has contemporaneous records of actions and administration processes that are relevant to preservation (Ingest: content acquisition).

These records must be created on or about the time of the actions they refer to and are related to actions taken during the Ingest: content acquisition process. The records may be automated or may be written by individuals, depending on the nature of the actions described. Where community or international standards are used, such as PREMIS (2005), the repository must demonstrate that all relevant actions are carried through.

Evidence: Written documentation of decisions and/or action taken; preservation metadata logged, stored, and linked to pertinent digital objects.



ICPSR's main business process is to take deposited materials (usually studies) and prepare them for preservation and dissemination (also as studies).  We use two internal webapps to collect, display, and set milestones and records during this process.

One system is called the Deposit Viewer, although it might be more properly called the Deposit Manager.  ICPSR staff use it to change the status of a deposit, assign a deposit to a worker, to read or create metadata about the deposit, and to link deposits to studies.  This system also allows staff (and sometimes requires them) to make comments in a diary associated with the deposit.



The other system is called the Study Tracking System, and like the Deposit Viewer, it collects milestones and diary entries during the ingest lifecycle.

The records are stored in a relational database.  This ensures that the content is readily available to the large corpus of workflow tools we've created.  We've been looking at PREMIS as a container for exposing these records to the outside world (where appropriate - like to an auditor perhaps), and for preserving them.  I have a personal interest in PREMIS and took a shot at creating PREMIS XML for our ingest records.  I'd be interested in comparing notes with others who have been working on mapping their internal ingest records to a schema like PREMIS.

Wednesday, October 6, 2010

WHEN ZOMBIES ATTACK!

This is my favorite fun paper that I've read this year.

When Zombies Attack!

Tuesday, October 5, 2010

NSF Social, Behavioral and Economic Directorate suggests ICPSR for data archiving

The much anticipated NSF guidelines on data management were released earlier this week.  One highlight (especially for those of us working at ICPSR) is that the NSF explicitly recognizes ICPSR as a good option for archiving quantitative social science data.

The SBE Directorate supplements agency-wide guidelines with some of its own, and has this to say:


Quantitative Social and Economic Data Sets 
For appropriate data sets, researchers should be prepared to place their data in fully cleaned and documented form in a data archive or library within one year after the expiration of an award. Before an award is made, investigators will be asked to specify in writing where they plan to deposit their data set(s). This may be the Inter-University Consortium for Politicaland Social Research (ICPSR) at the University of Michigan, but other public archives are also available. The investigator should consult with the program officer about the most appropriate archive for any particular data set.

Monday, October 4, 2010

Can an iPad replace a laptop on a business trip?

Walt Mossberg of the Wall Street Journal stole my idea!

Well, not really.  But he did recently write a nice column about his experience taking an iPad on a "working vacation" rather than a laptop.  I did the same thing for last week's trip to DC.

Like Mr Mossberg I wanted something that would let me keep in touch with the office, and that would help me pass some time at the airport and on the airplane.  I did not need Office-style applications since I was not intending to work with spreadsheets or deliver a presentation.

And like Mr Mossberg I too found that having the iPad alone was just fine for everything I wanted to do.  Safari gave me access to Gmail, the Kindle app gave me access to books to read, and a few other apps (e.g., Maps) filled in the rest of my needs.  Had I needed to deliver a PowerPoint deck, however, I would have brought a way-way-too-slow HP Mini netbook instead.

And, also like Mr Mossberg, I too spoke with several folks in the airport - especially the TSA checkpoints - about the iPad, giving it pretty rave reviews.

Friday, October 1, 2010

TRAC: B1.7: Formal acceptance of deposits

B1.7 Repository can demonstrate when preservation responsibility is formally accepted for the contents of the submitted data objects (i.e., SIPs).

A key component of a repository’s responsibility to gain sufficient control of digital objects is the point when the repository manages the bitstream. For some repositories this will occur when it first receives the SIP transformation, for others it may not occur until the ingested SIP is transformed into an AIP. At this point, the repository formally accepts preservation responsibility of digital objects from the depositor.

Repositories that report back to their depositors generally will mark this acceptance with some form of notification to the depositor. (This may depend on repository responsibilities as designated in the depositor agreement.) A repository may mark the transfer by sending a formal document, often a final signed copy of the transfer agreement, back to the depositor signifying the completion of the transformation from SIP to AIP process. Other approaches are equally acceptable. Brief daily updates may be generated by a repository that only provides annual formal transfer reports.

Evidence: Submission agreements/deposit agreements/deeds of gift; confirmation receipt sent back to producer.



My sense is that this requirement has two very different stories at ICPSR.

One story is pretty simple.  When the depositor signs his/her deposit, custody transfers to ICPSR.  We then work the deposit until we have a post-processed version suitable for digital preservation and versions suitable for delivery on the web site.

The other story is more complicated.  The workflow allows one to un-sign a deposit.  And so the custody of the object could transfer from the depositor to ICPSR (at the initial signing), and then back to the depositor (upon un-signing).  This can even happen in a case where the deposit has been processed and released on the ICPSR web site.  The workflow records this sort of action, and so it is well documented, but does leave open a degenerate case where content is available on the web without the corresponding original submission in archival storage.

Thursday, September 30, 2010

Designing Storage Architectures for Digital Preservation - Day One, Part Two

The second session of the first day featured technologists from higher education who either operate large archives, or who build systems for operating an archive.

Cory Snavely (University of Michigan, Hathitrust) gave a brief overview of Hathitrust, a repository of digital content shared by many of the Big Ten schools and a few other partners.

Brad McLean (Duraspace) reported on DuraCloud and results from the initial pilot partners.  (ICPSR is part of the current pilot, but was not a member of the original, smaller pilot program.)  He noted theseconcerns about using the cloud for digital preservation:
  1. Some services (such as Amazon's S3) have limits on the size of objects (files)
  2. Bandwidth limits on a per-server basis can impede function and performance
  3. Large files are troublesome
  4. Performance across the cloud can vary widely
  5. (File) naming matters; some storage services limit the type of characters in a name
Brad reiterated a comment made by several others:  A standard for checksums would be good to have.

Matt Schulz (MetaArchive) updated us on the MetaArchive, including a current partnership with Chronopolis.

David Minor (San Diego Supercomputer Center) updated us on the Chronopolis project.  David noted that SDSC is reimplementing its data center, and described three levels of storage in its future architecture:
  1. High-performance storage for scratch content
  2. Traditional filesystem storage
  3. Archival storage
The follow-on discussion included conversations about the right type of interface to access content in archival storage (POSIX, RESTful, object-oriented, etc); the trade-off between using long-lived media and systems for digital preservation v. taking advantage of advances in technology by using short-lived media and systems; and, David Rosenthal reminded everyone that we "... cannot test large systems for zero media failures."

I'll write-up my notes from Day Two early next week.

Wednesday, September 29, 2010

Designing Storage Architectures for Digital Preservation - Day One, Part One

I attended an event on Monday and Tuesday of last week that was hosted by the Library of Congress: Designing Storage Architectures for Digital Preservation.  I also attended the event last year, and so this was my second time attending.

Like last time there were many speakers, each giving a five minute presentation.  Unlike a TED talk where the presentation materials are built specifically to fit well within five minutes, many speakers had conventional slide decks, and raced through them quickly.  Those tended to be the weaker talks since the scope of the material was far too broad for the time allotted.  After a series of presentations there would be group discussion for 15-30 minutes which ran the gamut from interesting and provocative observations to chasing down rabbit holes.

I know the LoC will post complete information about the event, but here is my abbreviated version.  I've tried to hit what I considered to be the highlights, and so the reader should know that this report isn't complete.

The session opened with a video that argued that the Internet gives us more opportunity to innovate since it lowers the barrier for one's "hunches" to "collide" with those of another, and that innovation occurs when two or more good ideas come together.  Henry Newman then gave a framing overview for the meeting that included these interesting points.

  1. IT components are changing/improving at different rates; for example, processors are getting faster more quickly than buses are getting faster
  2. The preservation community and the IT community use different language to talk about archival storage
  3. Preservation TCO is not well understood
  4. The consumer market is driving the storage industry, not the enterprise market
The first of two sessions featured "heavy users" who spoke about some the challenges they faced.  The speakers included Ian Soborhoff (NIST), Mark Phillips (University of North Texas), Andy Maltz (Academy of Motion Picture Arts and Sciences), Ethan Miller (University of California - Santa Cruz), Arcot "Raja" Rajasekar (San Diego Supercomputer Center), Tom Garnett (Biodiversity Heritage Library), Barbara Taranto (New York Public Library), Martin Kalfatovic (Smithsonian Institution), and Tab Butler (Major League Baseball Network).  Highlights of their presentations and the follow-on discussion:
  • Experienced recent sea change where it was no longer possible to forecast storage needs whatsoever
  • "Archival storage... whatever that is."
  • Pergamum tome technology looks very interesting for smart, low-power storage
  • iRODS main components:  data server cloud, metadata catalog, and the rule engine
  • "Open access is a form of preservation."
  • If one needs N amount of space for one copy of archival storage, one also needs 2 x N or 3 x N for the ingest process
  • The "long now"
  • The MLB Network data archive will consume 9000 LTO-4 tapes for storage in 2010.
  • "Digital preservation sounds like hoarding."
  • "After our content was indexed by Google, usage went up 10x."
  • Data recovery from corrupted files is a digital preservation concern.
  • Forensics of a format migration is an effective tool for finding problems in a repository.
Next:  the second session of Day One.

Friday, September 24, 2010

TRAC: B1.6: Communicating with depositors


B1.6 Repository provides producer/depositor with appropriate responses at predefined points during the ingest processes.

Based on the initial processing plan and agreement between the repository and the producer/depositor, the repository must provide the producer/depositor with progress reports at specific, predetermined points throughout the ingest process. Responses can include initial ingest receipts, or receipts that confirm that the AIP has been created and stored. Repository responses can range from nothing at all to predetermined, periodic reports of the ingest completeness and correctness, error reports and any final transfer of custody document. Depositors can request further information on an ad hoc basis when the previously agreed upon reports are insufficient.

Evidence: Submission agreements/deposit agreements/deeds of gift; workflow documentation; standard operating procedures; evidence of reporting back.



ICPSR updates the depositor during the ingest process at two main points.

One, after the deposit is signed, ICPSR generates an inventory of the deposited content, and communicates this via email. This gives the depositor the opportunity to identify any content that was uploaded unintentionally or that may have become corrupted. These inventory reports are generated automatically by the deposit system.

Two, if deposited material is later used to produce an ICPSR study, the depositor is notified when that study is made available on ICPSR's web site, and when a normalized version of the content is moved into archival storage.

Thursday, September 23, 2010

ICPSR web site will be unavailable briefly on Monday, September 27, 2010

We've scheduled some maintenance around noon (EDT) on Monday, September 27, 2010.  We normally like to perform this type of work during off-hours, but this particular task is likely to be short-lived (10-15 minutes) and is best performed when we have our full team of software developers available.

My apologies in advance for any inconvenience.

P3P - Platform for Privacy Preferences Project

I came across an interesting paper on P3P (P3P is the Platform for Privacy Preferences Project) which is a W3C standard for expressing the privacy policies of a web site. The paper is from the CMU CyLab, and can be found here (PDF format).

A primary user of P3P is the Internet Explorer browser. It uses a "short form" of the policy to make decisions about whether a web site meets the security criteria one may set in the browser. Since most people never bother to configure different security levels for different sites, in practice any P3P descriptions that match "Medium security" will pass the security check.

The brief summary of the paper is that many of the top sites do not use P3P. Or, if they do use it, they make mistakes in the policy which will confuse browsers. And worse still, some sites seem to use P3P to actively trick browsers into thinking the site gathers no private information when it in fact does.

The paper is long, but many pages are part of an appendix. The main section of the paper is relatively short, well written, and is an interesting read.

Wednesday, September 22, 2010

DuraCloud fixity service testing

Our DuraCloud pilot test is going well. We have uploaded a test collection of nearly 70k files, representing that portion of our archival content that contains public-use datasets. (The datasets are public-use, but our licensing terms restrict access to some of these to our member institutions.)

To the left you can see a snapshot from the DurAdmin webapp that one uses to manage content. I've been using this webapp to view content, check progress, and download files. I've been using a command-line utility called synctool for copying content from ICPSR into DuraSpace, and keeping it synchronized.

The image to the left is the right-side panel from the Services tab of the DurAdmin webapp. I've deployed the Fixity service, and am using it to check the bit-level integrity of the content.

I started the service earlier this morning, and it still has quite a bit of work left to do. The processing-status line shows that the service has started, and that it is checked about 4300 of the files so far.

Monday, September 20, 2010

ICPSR increases security of its web transactions

We'll be making a small, but important, configuration change on our web server this week. For a long time we've allowed so-called "weak" ciphers to be used with HTTP connections over SSL (aka HTTPS). This was good for web site visitors who had very old browsers; so old that the browser did not support stronger SSL ciphers. But it is bad news for most of us who are running more recent software since it would allow one to use less robust encryption when exchanging content via HTTPS.

We've been running this newer configuration for many months on a web server we use for staging new content. The many browsers and platforms we use to test new web pages and software work well with this configuration, and so we've decided to move it into the production environment.

Wikipedia has a nice page that describes the technical details behind the various ciphers that are used with SSL (and its successor TLS).

Friday, September 17, 2010

TRAC: B1.5: Gaining control of deposits

B1.5 Repository obtains sufficient physical control over the digital objects to preserve them.

The repository must obtain complete control of the bits of the digital objects conveyed with each SIP. For example, some SIPs may only reference digital objects and in such cases the repository must get the referenced digital objects if they constitute part of the object that the repository has committed to conserve. This will not always be the case: scholarly papers in a repository may contain references to other papers that are held in a different repository, or not held anywhere at all, and harvested Web sites may contain references to material in the same site or different sites that the repository has chosen not to capture or was unable to capture.

Evidence: Submission agreements/deposit agreements/deeds of gift; workflow documents; system log files from the system performing ingest procedures; logs of files captured during Web harvesting.



This requirement is fairly straight-forward for ICPSR, given the type of content that we collect and curate. We gain complete control of the entire deposit, including both the research data and documentation. The deposit may also contain core related materials like the questionnaire that was used to collect the data.

That said, a deposit may be related to other objects outside the scope of ICPSR, such as publications related to the data. In this case ICPSR is not expecting to find such content in the deposit, nor would we tend to curate it even if it was present.

Tuesday, September 14, 2010

Amazon introduces new "micro" instances


ICPSR is taking advantage of a new "micro"-sized virtual machine offered by Amazon Web Services (AWS). Amazon describes the new instance this way:
Micro Instance 613 MB of memory, up to 2 ECUs (for short periodic bursts), EBS storage only, 32-bit or 64-bit platform
This looked like a good fit for the "stealth" DNS server that we run in Amazon's cloud, and so we converted it from a Small Instance - Reserved ($350 for a three year term + $0.03/hour) to a Micro Instance - Reserved ($82 for a three year term + $0.007/hour).

We have other lightly used instances running in Amazon's cloud, and we'll likely convert them over too.