Thursday, March 11, 2010

TRAC: B1.1: What we preserve

B1.1 Repository identifies properties it will preserve for digital objects.

This process begins in general with the repository’s mission statement and may be further specified in pre-accessioning agreements with producers or depositors (e.g., producer-archive agreements) and made very specific in deposit or transfer agreements for specific digital objects and their related documentation.

For example, one repository may only commit to preserving the textual content of a document and not its exact appearance on a screen. Another may wish to preserve the exact appearance and layout of textual documents, while others may choose to normalize the data during the ingest process.

Evidence: Mission statement; submission agreements/deposit agreements/deeds of gift; workflow and policy documents, including written definition of properties as agreed in the deposit agreement/deed of gift; written processing procedures; documentation of properties to be preserved.



My sense is that this TRAC requirement is more about policy than technology, but nonetheless I'll comment on it from a status quo, technological standpoint for the purposes of this post. That will give the reader a good sense for how things are working today, but leaves open the possibility (likelihood?) that ICPSR may change its mission statement, submission agreements, and/or other possibilities down the road, and then the underlying technology will change too.


The short version of the story is that ICPSR captures two essential items from each submission: the data (normalized into plain ASCII characters), and the technical documentation (normalized into DDI XML, PDF, and TIF images). And we keep the content in these formats until we need to migrate to a different format.

In our world of survey research data there is no requirement to preserve original formats (such as a SAS Portable file), look and feel (such as original technical documentation written in Word Perfect), or even low-level data format (such as numeric data coded in punched card format). We can take everything, pull out the essential intellectual content, and then normalize that content in quite durable formats.

That said, we may take these durable formats, such as DDI XML metadata and plain ASCII data, and feed them into tools that produce dissemination formats that are easier for our designated community to use. This includes common stat package formats, and also includes formats that feed into our on-line analysis system. However, these dissemination formats are not preserved, and they contain no essential properties not also contained in the preservation formats.

As I've been thinking about this particular TRAC requirement over the past week, I think we're probably missing one important piece of information for each of our data files: the character set. In practice almost all of our holdings are numeric data, and so our implicit character set ("us-ascii") is the correct one to use, but this seems like a good area for being more explicit. This would allow us to handle data files that contain non-numeric, non-US-ASCII characters in a more durable fashion.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.