Wednesday, September 14, 2011

August 2011 deposits

Time for the monthly deposit statistics:


# of files# of depositsFile format
11application/msaccess
21application/msoffice
13022application/msword
1046application/octet-stream
71529application/pdf
3010application/vnd.ms-excel
62application/vnd.ms-powerpoint
11application/x-dosexec
11application/x-empty
237application/x-sas
6712application/x-spss
147application/x-stata
43application/x-zip
62image/jpeg
63message/rfc8220117bit
346text/html
53text/plain; charset=iso-8859-1
84text/plain; charset=unknown
42028text/plain; charset=us-ascii
11text/plain; charset=utf-8
172text/rtf
52text/x-c; charset=unknown
71text/x-c; charset=us-ascii
1132text/xml
21very short file (no magic)

Lots of the usual kinds of stuff in August; maybe even a bit more than one would expect given the time of year.

There's the usual mistakes made by our file identity service; we're going to look at replacing or augmenting the current system (the UNIX file utility with a greatly expanded localmagic database + a wrapper that inspects the file extension) with something else.  We've spent just a tiny amount of time tinkering with Tika from the Apache project, and that looks promising.  This might even grow into a web service that we would share with others.

A couple of unusual items that merit closer inspection too, such as the purported DOS executable, and a bunch of (basically unrecognized) bitstreams.


2 comments:

  1. Hey Brian,

    Interesting. Hadn't been tracking Tika -- interesting. I assume you're also tracking JHOVE2 (jhove2.org) and FiTS (http://code.google.com/p/fits/)

    best,

    Micah

    ReplyDelete
  2. Hi, Micah. I've looked at JHOVE2, but haven't been satisfied with its reliability or its capabilities.

    I did a test-drive of several packages back in Q4 of 2010 (http://techaticpsr.blogspot.com/2010/11/trac-b27-format-registries.html) but couldn't find anything that worked well for ICPSR. I wish I could...

    ReplyDelete

Note: Only a member of this blog may post a comment.