Technology at ICPSR: January 2011 deposits at ICPSR

Wednesday, February 2, 2011

January 2011 deposits at ICPSR

First, the brief snapshot:

# of files	# of deposits	File format
4	2	application/msaccess
19	8	application/msword
11	3	application/octet-stream
162	23	application/pdf
4	3	application/vnd.ms-excel
4	4	application/x-sas
123	26	application/x-spss
1	1	application/x-stata
6	1	image/tiff
2	2	text/plain; charset=iso-8859-1
2	2	text/plain; charset=unknown
74	15	text/plain; charset=us-ascii
1	1	text/plain; charset=utf-8
6	4	text/rtf
1	1	text/x-c++; charset=us-ascii
4	2	text/x-c; charset=unknown
7	4	text/x-c; charset=us-ascii

In most ways this was a pretty typical month; most of the content coming into ICPSR continues to be survey data in either plain text or stat package format, and the accompanying documentation is a mix of PDF and word processing formats.

The last three rows where we purportedly received C and C++ source code are almost certainly wrong; our automated content identification service is based on a locally modified version of file, and we've found that file is a little too quick to peg things as C source code. No doubt this is due to the environment in which file was created and developed, but it is a hard problem to fix in a sustainable, general way. Do we hack up our local magic database even more? Maybe even eliminating the entries for C and C++ source code? Or do we post-process the output, transforming entries like text/x-c++ to text/plain? Or do we maintain a separate version of our improved file utility just for incoming deposits?

I think in the long run we might decide to live with the problem, using the output of our identification service as more of a recommendation, and we'll rely on the team of data managers to change those recommendations that don't match reality.

Technology at ICPSR

Wednesday, February 2, 2011

January 2011 deposits at ICPSR

No comments:

Post a Comment