Friday, September 9, 2011

TRAC: B6.10: Linking DIPs to AIPs

B6.10 Repository enables the dissemination of authentic copies of the original or objects traceable to originals.

Part of trusted archival management deals with the authenticity of the objects that are disseminated. A repository’s users must be confident that they have an authentic copy of the original object, or that it is traceable in some auditable way to the original object. This distinction is made because objects are not always disseminated in the same way, or in the same groupings, as they are deposited. A database may have subsets of its rows, columns, and tables disseminated so that the phrase “authentic copy” has little meaning. Ingest and preservation actions may change the formats of files, or may group and split the original objects deposited.

The distinction between authentic copies and traceable objects can also be important when transformation processes are applied. For instance, a repository that stores digital audio from radio broadcasts may disseminate derived text that is constructed by automated voice recognition from the digital audio stream. Derived text may be imperfect but useful to many users, though these texts are not authentic copies of the original audio. Producing an authentic copy means either handing out the original audio stream or getting a human to verify and correct the transcript against the stored audio.

This requirement ensures that ingest, preservation, and transformation actions do not lose information that would support an auditable trail between the original deposited object and the eventual disseminated object. For compliance, the chain of authenticity need only reach as far back as ingest, though some communities, such as those dealing with legal records, may require chains of authenticity that reach back
further.

A repository should be able to demonstrate the processes to construct the DIP from the relevant AIP(s). This is a key part of establishing that DIPs reflect the content of AIPs, and hence of original material, in a trustworthy and consistent fashion. DIPs may simply be a copy of AIPs, or may result from a simple format transformation of an AIP. But in other cases, they may be derived in complex ways from a large set of AIPs. A user may request a DIP consisting of the title pages from all e-books published in a given period, for instance, which will require these to be extracted from many different AIPs. A repository that allows requests for such complex DIPs will need to put more effort into demonstrating how it meets this requirement than a repository that only allows requests for DIPs that correspond to an entire AIP.

A repository is not required to show that every DIP it provides can be verified as authentic at a later date; it must show that it can do this when it is required at the time of production of the DIP. The level of authentication is to be determined by the designated community(ies). This requirement is meant to enable high levels of authentication, not to impose it on all copies, since it may be an expensive process.

Evidence: System design documents; work instructions (if DIPs involve manual processing); process walkthroughs; production of a sample authenticated copy; documentation of community requirements for authentication.



ICPSR has a long and interesting history in the context of this TRAC requirement.

I would assert that for ICPSR's first few decades of existence it considered itself more of a data library than a digital repository.  My sense is that there are not strong bonds between what one might call an AIP and a DIP from those early days.

Things seemed to change a bit in the 1990's, and I see evidence that the organization started to distinguish those items we received (acquisitions is the nomenclature used locally) from those items we produced (turnovers is the nomenclature we still use today).  Content began falling into more of a simple hierarchy with the acquisitions being kept in one place, and the turnovers being kept in a different place.

Connections are still pretty loose in the 90's, and one has to infer certain relationships.  Content is identified in the aggregate, rather than at the individual file level, and the identity of the person who "owned" or managed the collection figures prominently in the naming conventions.  If the earlier times were the digital Dark Ages at ICPSR in terms of digital preservation practice, the 90's were the Middle Ages.  Better, but still not modern.

When then-Director Myron Gutmann asked my team to automate much of the workflow in the mid 2000's (I came to ICPSR late in 2002), this began a process of building stronger connections between the content we received and the content we produced.  This was necessary since a lot of information that was captured only on paper or in the heads of people now needed to be in databases and programs.  Two people - Peggy Overcashier and Cole Whiteman - deserve most of the credit for this automation, but it was a very considerable team effort that involved many different parts of my team and ICPSR as a whole.  To keep the metaphor going, perhaps 2006 was the Renaissance of digital preservation practice.  And, not coincidentally, this was also the time that Nancy McGovern joined ICPSR as our Digital Preservation Officer.

My sense is that we now have good connections between AIP-like objects and DIP-like objections, but only at the aggregate level.  Even today we do not have crisply defined AIPs and DIPs, and we do not have the relationships recorded at the file-level.

This is due to two main problems that we hope to address in a new project code-named FLAME.  (This will be the subject of many future posts.)

One problem is that all of our DIPs are made by humans, and are made at the same time as the AIPs.  A future workflow should support the automatic generation of DIPs from AIPs, and this would allow us, for example, to update many of our DIPs automatically in response to changes in versions of SAS, SPSS, and Stata.

The other problem is that when we automated systems in the mid-2000's we didn't really fix the processes.  We made things faster, and we made things less error-prone, but we did not address some of the fundamental quirks in ICPSR's primary business processes.  Changing these processes from their current "study"-centric view of the universe to one that is more "file"-centric (or "object"-centric) will be the next big challenge ahead.  Stay tuned for details on this as we launch FLAME.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.