Technology at ICPSR: May 2009

Thursday, May 7, 2009

Two models per study or one?

In an earlier post this week, I proposed a method to store a study with a simple shape - ICPSR 8475 which is the American National Election Survey (cumulative file, 1948-2004) - and put it into Fedora. This method used a Content Model for the study, and then two separate Content Models for the main content, one for the data file and one for the technical documentation. This translated into a pair of Fedora objects that would look something like this:

In this case we have a single data object which conforms to two different Fedora Content Models. The datasetCM requires a text/plain datastream to hold the plain ASCII data, and the docsetCM requires a text/xml datastream to hold the dataset-level DDI-format XML. Our simple data object above has both of those datastreams, and so conforms to both Content Models.

Two separate Content Models is probably over-engineering for this study, but it makes sense for a more complex study. Let's take a look at the Intensive Community Study, ICPSR 6849.

This study features many pairs of data files that share a common codebook. In fact, all but one of the data files are part of just such a pair. In this case bundling the documentation directly with the dataset could lead to duplication in the repository, or the creation of special-case rules in software, and so instead we separate the content into three separate data objects: ICS data, ISR data, and the shared documentation:

We also have a parent-level data object, the study, which serves as a container for these three data objects, plus many other trios of similar pairs of data files with a shared codebook:

In this case the study data object would actually have many more values stored in it's RELS-EXT datastream than are shown above.

Tuesday, May 5, 2009

Content modeling social science data with Fedora

Here's a first cut at how one might model social science data content (like what we have at ICPSR) in Fedora.

My sense is that the "dataset" is the atomic object of interest. Heading down to the variable level feels too fine-grained for a system like Fedora. And so this led to creating two pretty simple Content Model objects:

Dataset - this has one datastream, DATA, and the MIME-type is text/plain
Docset - this has one datastream, DOCS, and the MIME-type is text/xml, and would hold DDI-format metadata at the dataset-level

I thought I would also need a Content Model for the basic unit of dissemination we use:

Study - this has one datastream, DOCS, and the MIME-type is text/xml, and would hold DDI-format metadata at the study-level

A simple study at ICPSR, say 8475 which has one data file and one documentation file, might then consist of a pair of Fedora data objects. I'll list the persistent ID (PID) first, followed by a list of datastreams in parentheses.

ICPSR-8475 (DC, RELS-EXT, AUDIT, DOCS)

The RELS-EXT datastream would express relationships to show that it contains two members, a Data object following the Dataset Content Model, and a Data object following the Docset Content Model. The RELS-EXT datastream would also assert a hasModel relationship to the Study Content Model. We also need a mechanism for storing access controls and license terms, but I'm still learning about the XACML stuff that might be a good way to do this. It may also make sense to have a datastream for DDI 2.x metadata and one for DDI 3.x metadata rather than just a single one.

Likewise, we would also have:

ICPSR-ANES-1948-2004 (DC, RELS-EXT, AUDIT, DATA, DOCS)

The RELS-EXT datastream would express a memberOf relationship to ICPSR-8475, a hasModel relationship to the dataset Content Model, and a hasModel relationship to the docset Content Model. Like the study Content Model data object above, still need to sort out the mechanism for storing access controls and license terms, and also if two datastreams v. one for DDI would be appropriate.

This then assumes that we might create the dissemination formats, like SPSS Export, on the fly since there isn't a datastream for that stuff. That might work, or we could add additional datastreams to the dataset that would point to the dissemination formats. We might want these to use the Externally Reference Content control group if we consider the dissemination formats to be somewhat ephemeral; the alternative would be Managed Content if we wanted to be able to manage the content in Fedora, perhaps allowing one to roll back to previous versions.

Next: So why two different Content Models, one for data and one for documentation, rather than just a single one?