Google+ Followers

Wednesday, February 2, 2011

January 2011 deposits at ICPSR

First, the brief snapshot:

# of files# of depositsFile format
22text/plain; charset=iso-8859-1
22text/plain; charset=unknown
7415text/plain; charset=us-ascii
11text/plain; charset=utf-8
11text/x-c++; charset=us-ascii
42text/x-c; charset=unknown
74text/x-c; charset=us-ascii

In most ways this was a pretty typical month; most of the content coming into ICPSR continues to be survey data in either plain text or stat package format, and the accompanying documentation is a mix of PDF and word processing formats.

The last three rows where we purportedly received C and C++ source code are almost certainly wrong; our automated content identification service is based on a locally modified version of file, and we've found that file is a little too quick to peg things as C source code.  No doubt this is due to the environment in which file was created and developed, but it is a hard problem to fix in a sustainable, general way.  Do we hack up our local magic database even more?  Maybe even eliminating the entries for C and C++ source code?  Or do we post-process the output, transforming entries like text/x-c++ to text/plain?  Or do we maintain a separate version of our improved file utility just for incoming deposits?

I think in the long run we might decide to live with the problem, using the output of our identification service as more of a recommendation, and we'll rely on the team of data managers to change those recommendations that don't match reality.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.