Monday, April 2, 2012

FLAME update

We have three different tracks running on FLAME.

One track is conducting an analysis of the business requirements ICPSR has for what we are calling our "self archived" collection.  This is a collection of material best represented today by our Publication Related Archive, a set of materials that receives very, very little scrutiny between time of deposit and time of release on the web site.  We are imagining a future world where the quantity of "self-archived" materials increases dramatically from today's volumes, driven by NIH and NSF requirements to share and manage data.

I see the following questions generating the most discussion on this track:  How much disclosure review is necessary before releasing the content publicly?  Should the depositor have "edit" access to the metadata?  If so, should it be moderated or completely open?  How much "touch" does ICPSR need to have on these materials?

Another track is working on a crisp, concrete definition of what it means to "normalize" a system file from SAS, SPSS, or Stata.  ICPSR has long said that our approach is to "normalize" such files, producing plain ASCII data and set-ups, but what does that really mean?  And is that really possible?

I see the following questions generating the most discussion on this track:  Is ASCII the right thing, or ought it be a Unicode character set?  Are set-ups the right documentation or should it be DDI XML?  If we choose the former, is it sufficient to produce set-ups compatible with the original content type (e.g., SAS setups for a SAS file)?  What about precision?  Length of variable names?  Question text?  Is it possible to normalize without loss, and if not, how much loss is acceptable?  Can a computer do this without human intervention 99% of the time?

And the last track is working on a matrix that maps a set of parameters (inputs) to a resulting preservation commitment and set of actions.  For example, if one has a file which contains "documentation" (the type of content) in XML format in the UTF-8 character set (the format of the file), then perhaps the preservation commitment is "full preservation."

The key questions here, I believe, will be around what the right list of parameters is.  And if any of the parameters uses a controlled vocabulary, what's in the CV?  And what exactly does it mean to have a "full preservation" commitment?  What's involved beyond just keeping the bits around, which is presumably all one does with "bit-level preservation?"

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.