Thursday, May 7, 2009

Two models per study or one?

In an earlier post this week, I proposed a method to store a study with a simple shape - ICPSR 8475 which is the American National Election Survey (cumulative file, 1948-2004) - and put it into Fedora. This method used a Content Model for the study, and then two separate Content Models for the main content, one for the data file and one for the technical documentation. This translated into a pair of Fedora objects that would look something like this:



In this case we have a single data object which conforms to two different Fedora Content Models. The datasetCM requires a text/plain datastream to hold the plain ASCII data, and the docsetCM requires a text/xml datastream to hold the dataset-level DDI-format XML. Our simple data object above has both of those datastreams, and so conforms to both Content Models.

Two separate Content Models is probably over-engineering for this study, but it makes sense for a more complex study. Let's take a look at the Intensive Community Study, ICPSR 6849.

This study features many pairs of data files that share a common codebook. In fact, all but one of the data files are part of just such a pair. In this case bundling the documentation directly with the dataset could lead to duplication in the repository, or the creation of special-case rules in software, and so instead we separate the content into three separate data objects: ICS data, ISR data, and the shared documentation:




We also have a parent-level data object, the study, which serves as a container for these three data objects, plus many other trios of similar pairs of data files with a shared codebook:


In this case the study data object would actually have many more values stored in it's RELS-EXT datastream than are shown above.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.