B2.7 Repository demonstrates that it has access to necessary tools and resources to establish authoritative semantic or technical context of the digital objects it contains (i.e., access to appropriate international Representation Information and format registries).
The Global Digital Format Registry (GDFR), the UK National Archives’ file format registry PRONOM, and the UK Digital Curation Centre’s Representation Information Registry are three emerging examples of potential international standards a repository might adopt. Whenever possible, the repository should use these types of standardized, authoritative information sources to identify and/or verify the Representation Information components of Content Information and PDI. This will reduce the long-term maintenance costs to the repository and improve quality control.
Most repositories will maintain format information locally to maintain their independent ability to verify formats or other technical or semantic details associated with each archival object. In these cases, the use of international format registries is not meant to replace local format registries but instead serve as a resource to verify or obtain independent, authoritative information about any and all file formats.
Evidence: Subscription or access to such registries; association of unique identifiers to format registries with digital objects.
The volume of content entering ICPSR is relatively low, perhaps 100 submissions per month. In some extreme cases, such as with our Publication Related Archive, ICPSR staff spend relatively little time reviewing and normalizing content, and it is released "as is" on the web site, and it gets the most modest level of digital preservation (bit-level only, unless the content happens to be something more durable, such as plain text). However, in most cases, someone at ICPSR is opening each file, reading and reviewing documentation, scrubbing data for disclosure risk, recoding, etc, etc. It is a very hands-on process.
Because of the low volumes and high touches, automated format detection is not at all essential, at least for the current business model. Nonetheless we do use automated format detection for both the files that we receive via our deposit system, and for the derivative files we produce internally. And our tool for doing this is the venerable UNIX command-line utility
file.
Why?
The content that we tend to receive is a mix of documentation and data. The documentation is often in PDF format, but sometimes arrives in common word processor formats like DOC and DOCS, and sometimes less common word processor formats. The data is often in a format produced by common statistical packages such as SAS, SPSS, and Stata. And we also get a nice mix of other file formats from a wide variety of business applications like Access, Excel, PowerPoint, and more.
We have found the vanilla
file that ships with Red Hat Linux to be pretty good at most of the formats that show up on our doorstep. We've extended the
magic database that file consults so that it does a better job understanding a more broad selection of stat package formats. (
file does OK, but not great in this area.) We also have extended the
magic database and wrapped
file in a helper tool -- we call the final product
ifile, for improved
file -- so that it does a better job identifying the newer Microsoft Office file formats like DOCX, XLSX, PPTX, and so on.
I would love to be able to use an "off the shelf" tool like
jhove or
droid to identify file formats, relying upon a global registry for formats. There isn't much glamor in hacking the
magic database.
However, my experience thus far with
jhove,
jhove2, and
droid is that they just don't beat
ifile (or even
file) for the particular mix of content we tend to get. Those packages are much more heavy-weight, and while they do a fabulous job on some formats (like PDF), they perform poorly or not at all with many of the formats we see on a regular basis.
As a test I took a week or two of the files most recently deposited at ICPSR, and I had
file,
ifile, and
droid have a go at identifying them. I originally had
jhove2 in the mix as well, but I could not get it to run reliably. (And my sense is that it may be using the same data source as droid for file identification anyway.) Of the 249 files I examined,
ifile got 'em all right,
file misreported on 48 of them, and
droid misreported 104 files. And the number for
droid gets worse if I ding it for reporting a file containing an email as text/plain rather than text/rfc822.
So in the end we're using our souped-up version of
file for file format identification, and we're using IANA MIME types as our primary identifier for file format. We also capture the more verbose human-readable output from our own
ifile as well since it can be handy to have "SPSS System File MS Windows Release 10.0.5" rather than just application/x-spss.