|# of files||# of deposits||File format|
In most ways this was a pretty typical month; most of the content coming into ICPSR continues to be survey data in either plain text or stat package format, and the accompanying documentation is a mix of PDF and word processing formats.
The last three rows where we purportedly received C and C++ source code are almost certainly wrong; our automated content identification service is based on a locally modified version of file, and we've found that file is a little too quick to peg things as C source code. No doubt this is due to the environment in which file was created and developed, but it is a hard problem to fix in a sustainable, general way. Do we hack up our local magic database even more? Maybe even eliminating the entries for C and C++ source code? Or do we post-process the output, transforming entries like text/x-c++ to text/plain? Or do we maintain a separate version of our improved file utility just for incoming deposits?
I think in the long run we might decide to live with the problem, using the output of our identification service as more of a recommendation, and we'll rely on the team of data managers to change those recommendations that don't match reality.