# of files | # of deposits | File format |
1 | 1 | application/msaccess |
2 | 1 | application/msoffice |
130 | 22 | application/msword |
104 | 6 | application/octet-stream |
715 | 29 | application/pdf |
30 | 10 | application/vnd.ms-excel |
6 | 2 | application/vnd.ms-powerpoint |
1 | 1 | application/x-dosexec |
1 | 1 | application/x-empty |
23 | 7 | application/x-sas |
67 | 12 | application/x-spss |
14 | 7 | application/x-stata |
4 | 3 | application/x-zip |
6 | 2 | image/jpeg |
6 | 3 | message/rfc8220117bit |
34 | 6 | text/html |
5 | 3 | text/plain; charset=iso-8859-1 |
8 | 4 | text/plain; charset=unknown |
420 | 28 | text/plain; charset=us-ascii |
1 | 1 | text/plain; charset=utf-8 |
17 | 2 | text/rtf |
5 | 2 | text/x-c; charset=unknown |
7 | 1 | text/x-c; charset=us-ascii |
113 | 2 | text/xml |
2 | 1 | very short file (no magic) |
Lots of the usual kinds of stuff in August; maybe even a bit more than one would expect given the time of year.
There's the usual mistakes made by our file identity service; we're going to look at replacing or augmenting the current system (the UNIX file utility with a greatly expanded localmagic database + a wrapper that inspects the file extension) with something else. We've spent just a tiny amount of time tinkering with Tika from the Apache project, and that looks promising. This might even grow into a web service that we would share with others.
A couple of unusual items that merit closer inspection too, such as the purported DOS executable, and a bunch of (basically unrecognized) bitstreams.
Hey Brian,
ReplyDeleteInteresting. Hadn't been tracking Tika -- interesting. I assume you're also tracking JHOVE2 (jhove2.org) and FiTS (http://code.google.com/p/fits/)
best,
Micah
Hi, Micah. I've looked at JHOVE2, but haven't been satisfied with its reliability or its capabilities.
ReplyDeleteI did a test-drive of several packages back in Q4 of 2010 (http://techaticpsr.blogspot.com/2010/11/trac-b27-format-registries.html) but couldn't find anything that worked well for ICPSR. I wish I could...