Technology at ICPSR: TRAC: B2.7: Format registries

B2.7 Repository demonstrates that it has access to necessary tools and resources to establish authoritative semantic or technical context of the digital objects it contains (i.e., access to appropriate international Representation Information and format registries).

The Global Digital Format Registry (GDFR), the UK National Archives’ file format registry PRONOM, and the UK Digital Curation Centre’s Representation Information Registry are three emerging examples of potential international standards a repository might adopt. Whenever possible, the repository should use these types of standardized, authoritative information sources to identify and/or verify the Representation Information components of Content Information and PDI. This will reduce the long-term maintenance costs to the repository and improve quality control.

Most repositories will maintain format information locally to maintain their independent ability to verify formats or other technical or semantic details associated with each archival object. In these cases, the use of international format registries is not meant to replace local format registries but instead serve as a resource to verify or obtain independent, authoritative information about any and all file formats.

Evidence: Subscription or access to such registries; association of unique identifiers to format registries with digital objects.

The volume of content entering ICPSR is relatively low, perhaps 100 submissions per month. In some extreme cases, such as with our Publication Related Archive, ICPSR staff spend relatively little time reviewing and normalizing content, and it is released "as is" on the web site, and it gets the most modest level of digital preservation (bit-level only, unless the content happens to be something more durable, such as plain text). However, in most cases, someone at ICPSR is opening each file, reading and reviewing documentation, scrubbing data for disclosure risk, recoding, etc, etc. It is a very hands-on process.

Because of the low volumes and high touches, automated format detection is not at all essential, at least for the current business model. Nonetheless we do use automated format detection for both the files that we receive via our deposit system, and for the derivative files we produce internally. And our tool for doing this is the venerable UNIX command-line utility file.

Why?

The content that we tend to receive is a mix of documentation and data. The documentation is often in PDF format, but sometimes arrives in common word processor formats like DOC and DOCS, and sometimes less common word processor formats. The data is often in a format produced by common statistical packages such as SAS, SPSS, and Stata. And we also get a nice mix of other file formats from a wide variety of business applications like Access, Excel, PowerPoint, and more.

We have found the vanilla file that ships with Red Hat Linux to be pretty good at most of the formats that show up on our doorstep. We've extended the magic database that file consults so that it does a better job understanding a more broad selection of stat package formats. (file does OK, but not great in this area.) We also have extended the magic database and wrapped file in a helper tool -- we call the final product ifile, for improved file -- so that it does a better job identifying the newer Microsoft Office file formats like DOCX, XLSX, PPTX, and so on.

I would love to be able to use an "off the shelf" tool like jhove or droid to identify file formats, relying upon a global registry for formats. There isn't much glamor in hacking the magic database.

However, my experience thus far with jhove, jhove2, and droid is that they just don't beat ifile (or even file) for the particular mix of content we tend to get. Those packages are much more heavy-weight, and while they do a fabulous job on some formats (like PDF), they perform poorly or not at all with many of the formats we see on a regular basis.

As a test I took a week or two of the files most recently deposited at ICPSR, and I had file, ifile, and droid have a go at identifying them. I originally had jhove2 in the mix as well, but I could not get it to run reliably. (And my sense is that it may be using the same data source as droid for file identification anyway.) Of the 249 files I examined, ifile got 'em all right, file misreported on 48 of them, and droid misreported 104 files. And the number for droid gets worse if I ding it for reporting a file containing an email as text/plain rather than text/rfc822.

So in the end we're using our souped-up version of file for file format identification, and we're using IANA MIME types as our primary identifier for file format. We also capture the more verbose human-readable output from our own ifile as well since it can be handy to have "SPSS System File MS Windows Release 10.0.5" rather than just application/x-spss.

1 comment:

AnonymousJanuary 13, 2011 at 5:14 PM
I don't think I've commented before: I think your blog is really interesting & useful for anyone interested in digital repositories, esp. entries like this, on your experience with tools like droid & jhove, & on the TRAC requirements

Note: Only a member of this blog may post a comment.

Wednesday, November 24, 2010

TRAC: B2.7: Format registries

1 comment: