Wednesday, February 2, 2011

January 2011 deposits at ICPSR

First, the brief snapshot:

# of files# of depositsFile format
42application/msaccess
198application/msword
113application/octet-stream
16223application/pdf
43application/vnd.ms-excel
44application/x-sas
12326application/x-spss
11application/x-stata
61image/tiff
22text/plain; charset=iso-8859-1
22text/plain; charset=unknown
7415text/plain; charset=us-ascii
11text/plain; charset=utf-8
64text/rtf
11text/x-c++; charset=us-ascii
42text/x-c; charset=unknown
74text/x-c; charset=us-ascii


In most ways this was a pretty typical month; most of the content coming into ICPSR continues to be survey data in either plain text or stat package format, and the accompanying documentation is a mix of PDF and word processing formats.

The last three rows where we purportedly received C and C++ source code are almost certainly wrong; our automated content identification service is based on a locally modified version of file, and we've found that file is a little too quick to peg things as C source code.  No doubt this is due to the environment in which file was created and developed, but it is a hard problem to fix in a sustainable, general way.  Do we hack up our local magic database even more?  Maybe even eliminating the entries for C and C++ source code?  Or do we post-process the output, transforming entries like text/x-c++ to text/plain?  Or do we maintain a separate version of our improved file utility just for incoming deposits?

I think in the long run we might decide to live with the problem, using the output of our identification service as more of a recommendation, and we'll rely on the team of data managers to change those recommendations that don't match reality.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.