Technology at ICPSR: TRAC: B1.4: Checking completeness and correctness

B1.4 Repository’s ingest process verifies each submitted object (i.e., SIP) for completeness and correctness as specified in B1.2.

Information collected during the ingest process must be compared with information from some other source—the producer or the repository’s own expectations—to verify the correctness of the data transfer and ingest process. The extent to which a repository can determine correctness will depend on what it knows about the SIP and what tools are available for verifying correctness. It can mean simply checking that file formats are what they claim to be (TIFF files are valid TIFF format, for instance), or can imply checking the content. This might involve human checking in some cases, such as confirming that the description of a picture matches the image.

Repositories should have established procedures for handling incomplete SIPs. These can range from rejecting the transfer, to suspending processing until the missing information is received, to simply reporting the errors. Similarly, the definition of “completeness” should be appropriate to a repository’s activities. If an inventory of files was provided by a producer as part of pre-ingest negotiations, one would expect checks to be carried out against that inventory. But for some activities such as Web harvesting, “complete” may simply mean “whatever we could capture in the harvest session.” Whatever checks are carried out must be consistent with the repository’s own documented definition and understanding of completeness and correctness.

Evidence: Appropriate policy documents and system log files from system performing ingest procedure; formal or informal acquisitions register of files received during the transfer and ingest process; workflow, documentation of standard operating procedures, detailed procedures; definition of completeness and correctness, probably incorporated in policy documents.

For this post I am going to focus on the files that a depositor uploads to our web site. The other elements of the deposit, which are largely metadata, are collected automatically by the deposit system. It's only a title or name for the deposit that the depositor must provide.

As mentioned in an early post on deposits, we collect several pieces of information about each file. One item we collect is the MIME type, and we do this using the UNIX file utility, but where we have expanded the magic database to include information about common statistical packages and Microsoft Office file formats. For example, a vanilla, out-of-the-box version of file will report a DOCX format file as a Zip file, and while that is correct on some level, it wasn't the best description of the file.

After making this inventory, our system generates an auto-reply to the depositor enumerating the files and what we think they are. Note that we do not assign any higher level purpose (e.g., this is a data file; this is a codebook) programatically.

Assuming that the depositor does not find anything amiss with our inventory, a data manager will pick up the deposit, and will start preparing the materials for preservation at ICPSR, and for distribution on our web site.

A couple of things we are NOT doing, but which might be valuable in the future....

One, in addition to using file to report MIME types, we might also run a tool like JHOVE to inspect the correctness of the file formats. We do get a fair number of PDF format documents, and JHOVE does a nice job with those. However, we also get a lot of plain text documents, stats files, and MS Office files, and JHOVE doesn't do a very nice job with those. I've wondered if we might be able to write a small grant where ICPSR would promise to build plug-ins for JHOVE for the stats package formats.

Two, in addition to reporting file formats to our depositors, we might also report checksums for each file so that they would have the opportunity to inspect that the deposit went without error.

Technology at ICPSR

Friday, September 10, 2010

TRAC: B1.4: Checking completeness and correctness

No comments:

Post a Comment