Technology at ICPSR: February 2011

Friday, February 25, 2011

TRAC: B2.13: Documenting preservation actions

B2.13 Repository has contemporaneous records of actions and administration processes that are relevant to preservation (AIP creation).

These records must be created on or about the time of the actions they refer to and are related to actions associated with AIP creation. The records may be automated or may be written by individuals, depending on the nature of the actions described. Where community or international standards are used, such as PREMIS (2005), the repository must demonstrate that all relevant actions are carried through.

Evidence: Written documentation of decisions and/or action taken; preservation metadata logged, stored, and linked to pertinent digital objects.

ICPSR manages the record of administrative actions in three different ways.

At the micro level, actions performed on individual files, or collections of files, are recorded in an internal workflow system called the Study Tracking System. This records key events and milestones as they occur, and also has a mechanism for free-form annotations (diary entries).

Also at the micro level, ICPSR data managers often produce a consolidate list of commands (syntax files) that document how we have transformed content that we have received. In an idealized case one could take an original survey data file, apply the syntax, and the output would be the file we preserve and distribute.

At a more macro level ICPSR records important events and documents their results in an internal content management system. (We're currently using Drupal for this, but any content management system would be suitable for this purpose.) This system would capture information such as the deployment of a new archival storage location.

Friday, February 18, 2011

TRAC: B2.12: Providing an audit capability

B2.12 Repository provides an independent mechanism for audit of the integrity of the repository collection/content.

In general, it is likely that a repository that meets all the previous criteria will satisfy this one without needing to demonstrate anything more. As a separate requirement, it demonstrates the importance of being able to audit the integrity of the collection as a whole.

For example, if a repository claims to have all e-mail sent or received by The Yoyodyne Corporation between 1985 and 2005, it has been required to show that:

The content it holds came from Yoyodyne’s e-mail servers.
It is all correctly transformed into a preservation format.
Each monthly SIP of e-mail has been correctly preserved, including original unique identifiers such as Message-IDs.

However it may still have no way of showing whether this really represents all of Yoyodyne’s email. For example, if there is a three-day period with no messages in the repository, is this because Yoyodyne was shut down for those three days, or was the e-mail lost before the SIP was constructed? This case could be resolved by the repository amending its description of the collection, but other cases may not be so straightforward.

A familiar mechanism from the world of traditional materials in libraries and archives is an accessions or acquisitions register that is independent of other catalog metadata. A repository should be able to show, for each item in its accessions register, which AIP(s) contain content from that item. Alternatively, it may need to show that there is no AIP for an item, either because ingest is still in progress, or because the item was rejected for some reason. Conversely, any AIP should be able to be related to an entry in the acquisitions register.

Evidence: Documentation provided for B2.1 through B2.6; documented agreements negotiated between the producer and the repository (see B 1.1-B1.9); logs of material received and associated action (receipt, action, etc.) dates; logs of periodic checks.

ICPSR meets this requirement by maintaining as accession register which is a very long (and always growing) list of the files that we preserve. A weekly automated job uses this list as input, and checks to see if each item is still available in archival storage, and also checks to see if the item is intact (i.e., its digital signature has not changed).

Friday, February 4, 2011

TRAC: B2.11: AIP Verification

B2.11 Repository verifies each AIP for completeness and correctness at the point it is generated.

If the repository has a standard process to verify SIPs for either or both completeness and correctness and a demonstrably correct process for transforming SIPs into AIPs, then it simply needs to demonstrate that the initial checks were carried out successfully and that the transformation process was carried out without indicating errors. Repositories that must create unique processes for many of their AIPs will also need to generate unique methods for validating the completeness and correctness of AIPs. This may include performing tests of some sort on the content of the AIP that can be compared with tests on the SIP. Such tests might be simple (counting the number of records in a file, or performing some simple statistical measure such as calculating the brightness histogram of an original and preserved image), but they might be complex or contain some subjective elements.

Documentation should describe how completeness and correctness of SIPs and AIPs are ensured, starting with ensuring receipt from the producer and continuing through AIP creation and supporting long-term preservation. Example approaches include the use of checksums, testing that checksums are still correct at various points during ingest and preservation, logs that such checks have been made, and any special tests that may be required for a particular SIP/AIP instance or class.

Evidence: Description of the procedure that verifies completeness and correctness; logs of the procedure.

A few of my earlier posts have described the deposit system at ICPSR, and so with this post I would like to focus on the AIP. My sense is that the ICPSR package that is closest to the AIP is what ICPSR insiders would call "the turnover directory."

At least half of the ICPSR staff fall into a category called "data processors" or "data managers." These are the folks who take the deposits we receive and turn them into content that we can preserve and content that we deliver on our web site. They work in different teams, and are funded through a variety of mechanisms - membership dues, long-standing contracts with federal agencies, and even inter-agency agreements. Some of them work on a large number of collections each year, and others work on a very small number of collections. But all of them perform a series of work processes that end with a collection of content loaded into a single directory. This is the turnover directory.

At the point the content has been pulled into this single location, the data manager runs a tool which performs a broad variety of jobs, but which boils down into two essential tasks: conformance checking and ingest.

The conformance checking is at the heart of the TRAC requirement. This is where the content that we are about to ingest goes through a variety of checks; these checks implement (in software) a laundry list of business rules and requirements which are documented on ICPSR's intranet, and are managed by a committee.

In addition to the explicit data management checks, the system also records critical preservation metadata such as fixity, provenance, and context information.

Wednesday, February 2, 2011

January 2011 deposits at ICPSR

First, the brief snapshot:

# of files	# of deposits	File format
4	2	application/msaccess
19	8	application/msword
11	3	application/octet-stream
162	23	application/pdf
4	3	application/vnd.ms-excel
4	4	application/x-sas
123	26	application/x-spss
1	1	application/x-stata
6	1	image/tiff
2	2	text/plain; charset=iso-8859-1
2	2	text/plain; charset=unknown
74	15	text/plain; charset=us-ascii
1	1	text/plain; charset=utf-8
6	4	text/rtf
1	1	text/x-c++; charset=us-ascii
4	2	text/x-c; charset=unknown
7	4	text/x-c; charset=us-ascii

In most ways this was a pretty typical month; most of the content coming into ICPSR continues to be survey data in either plain text or stat package format, and the accompanying documentation is a mix of PDF and word processing formats.

The last three rows where we purportedly received C and C++ source code are almost certainly wrong; our automated content identification service is based on a locally modified version of file, and we've found that file is a little too quick to peg things as C source code. No doubt this is due to the environment in which file was created and developed, but it is a hard problem to fix in a sustainable, general way. Do we hack up our local magic database even more? Maybe even eliminating the entries for C and C++ source code? Or do we post-process the output, transforming entries like text/x-c++ to text/plain? Or do we maintain a separate version of our improved file utility just for incoming deposits?

I think in the long run we might decide to live with the problem, using the output of our identification service as more of a recommendation, and we'll rely on the team of data managers to change those recommendations that don't match reality.