# of files | # of deposits | File format |
4 | 2 | application/msaccess |
19 | 8 | application/msword |
11 | 3 | application/octet-stream |
162 | 23 | application/pdf |
4 | 3 | application/vnd.ms-excel |
4 | 4 | application/x-sas |
123 | 26 | application/x-spss |
1 | 1 | application/x-stata |
6 | 1 | image/tiff |
2 | 2 | text/plain; charset=iso-8859-1 |
2 | 2 | text/plain; charset=unknown |
74 | 15 | text/plain; charset=us-ascii |
1 | 1 | text/plain; charset=utf-8 |
6 | 4 | text/rtf |
1 | 1 | text/x-c++; charset=us-ascii |
4 | 2 | text/x-c; charset=unknown |
7 | 4 | text/x-c; charset=us-ascii |
In most ways this was a pretty typical month; most of the content coming into ICPSR continues to be survey data in either plain text or stat package format, and the accompanying documentation is a mix of PDF and word processing formats.
The last three rows where we purportedly received C and C++ source code are almost certainly wrong; our automated content identification service is based on a locally modified version of file, and we've found that file is a little too quick to peg things as C source code. No doubt this is due to the environment in which file was created and developed, but it is a hard problem to fix in a sustainable, general way. Do we hack up our local magic database even more? Maybe even eliminating the entries for C and C++ source code? Or do we post-process the output, transforming entries like text/x-c++ to text/plain? Or do we maintain a separate version of our improved file utility just for incoming deposits?
I think in the long run we might decide to live with the problem, using the output of our identification service as more of a recommendation, and we'll rely on the team of data managers to change those recommendations that don't match reality.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.