Technology at ICPSR: May 2011

Tuesday, May 31, 2011

Technical solutions to non-technical problems

I was reading Ars Technica the other day and came across an article they ran on the PROTECT IP bill. In brief, this is a bill which has the goal to improve the protection of intellectual property by using the domain name system (DNS) to re-route traffic away from the bad guys.

I didn't finish reading the article, but I did read a white paper cited by the article. It's authors are experts in securing and operating the DNS, and they are quite opposed to the bill. The authors point out (correctly) that while protecting intellectual property is a fine idea, going about it by (effectively) breaking the DNS within the United States is a very silly idea.

The DNS is the large, successful, distributed system that maps names (like www.icpsr.umich.edu) to numeric IP addresses (like 141.211.146.80). This is handy since people can often remember a DNS-style name more easily than a "dotted quad" set of numbers. The stability, performance, and correctness of the DNS is one of the things that makes the Internet work.

The idea in the bill is that the US government would be able to order service providers to fiddle with the answers returned by DNS servers. And so, say, if ICPSR started stealing intellectual property to make available on our web site, the government could arrange to have the answer to the question "What's the IP address of www.icpsr.umich.edu?" changed from 141.211.146.80 to the IP address of a government web site that would explain to the user that they had been redirected, and that ICPSR is a bunch of intellectual property stealing bad guys.

That should work, right?

Unless.... people really do want to get to that stolen intellectual property. And so instead of typing in www.icpsr.umich.edu into their browser, they type the address. Or if they add an entry to their "hosts file" to map the name and IP address (instead of relying on the DNS). Or if someone releases some malware that fiddles with their host file or their registry (if they run Windows). Or if, in this case, ICPSR moves its DNS service outside the borders of the US. Or....

So lots of ways to work around the "solution."

It feels like the government is trying to design a technical solution to a non-technical problem.

I'm shocked. Shocked.

This is probably one of the best examples of how good intentions can still lead people to do the wrong thing. The very wrong thing.

Monday, May 30, 2011

Vaults of Heaven: Visions of Byzantium

Vaults of Heaven: Visions of Byzantium

We made a family trip to one of the University of Michigan's smaller museums, a little gem called the Kelsey Museum of Archaeology. Even though they added a large new wing for exhibition space, the entire museum is still quite small, and is perfect for smaller children (but who are still old enough to be interested in going to a museum).

We wanted to be sure to check out a special exhibit that ended on May 27, 2001 called Vaults of Heaven: Visions of Byzantium. While the museum's permanent collection contains artifacts from the ancient world (Egypt, Greece, Rome), this exhibit featured relatively recent items from between the sixth and fifteenth centuries. There's an image from the the exhibit on their web site, and I've created a link to it to the left.

One thing that struck me about the items in the exhibit were the similarities -- and the differences! -- between how a museum like the Kelsey preserves objects and makes them available for access, and how a place like ICPSR does it. Another reminder about how the digital world and the physical world are very, very different.

For example, one item available to view at the Kelsey was a small piece of pottery with an image of a person on it. No doubt it is very fragile, and it probably needs to be kept in a very safe, climate-controlled location when it isn't on exhibit. There was a card next to the object that contained some (all?) of the information the museum had collected about it: When it was likely made, what it was used for, and the identity of the person in the image. (St Simeon the Stylite.) This information also needs to be preserved and made accessible, but it would not need to be kept in the same type of storage as the object. In fact, it may well be the case that the metadata in this case is kept in digital format, and only printed out on a card for access purposes. And so one of the key preservation tasks would seem to be maintaining a reliable, bi-directional, long-lived link between the metadata and the object. If the link breaks, then the task of finding the object (or re-discovering what it is) becomes very, very difficult.

In the digital world of ICPSR, we face some of the same issues (climate controlled storage for objects, purchasing and managing storage space, linking metadata to objects), but my sense is that we have a much easier time of it when it comes to linking metadata to objects.

For one thing, both our objects and our metadata are built from the same stuff - bits - and so keeping them in the same type of storage is easy and makes sense. (And it sure is much easier to make copies of bits that centuries old pieces of pottery.)

Also, because our stuff resides in the digital world and tends to be kept in a file, there's the filename that one can use to help identify the object, even in the absence of metadata. And so if I have a digital object without metadata, I still have the filename (and the content, of course) to help me identify it.

And, for some type of files, like PDF, one can bundle a great deal of the metadata inside the file itself. This creates a very close coupling between the object and its description. This type of close coupling is also available via some of the stat packages, but becomes less useful if the file may only be read successfully with proprietary software.

Friday, May 27, 2011

TRAC: B4.3: Preserving Content Information

B4.3 Repository preserves the Content Information of archival objects (i.e., AIPs).

The repository must be able to demonstrate that the AIPs faithfully reflect what was captured during ingest and that any subsequent or future planned transformations will continue to preserve that aspect of the repository’s holdings.

This requirement assumes that the repository has a policy specifying that AIPs cannot be deleted at any time. This particularly simple and robust implementation preserves links between what was originally ingested, as well as new versions that have been transformed or changed in any way. Depending upon implementation, these newer objects may be completely new AIPs or merely updated AIPs. Either way, persistent links between the ingested object and the AIP should be maintained.

Evidence: Policy documents specifying treatment of AIPs and whether they may ever be deleted; ability to demonstrate the chain of AIPs for any particular digital object or group of objects ingested; workflow procedure documentation.

ICPSR has a pretty good story to tell here.

AIPs are available as read-only content available to staff. Only system administrators have write access to AIPs, and we try to limit the opportunity for accidental deleting or corrupting as much possible. This means that AIPs are stored in a single container, and that container isn't particularly accessible. The read-only access is granted by a special-purpose web application that runs inside ICPSR's Secure Data Environment (which should be the topic of a future post).

ICPSR's core business process generates new AIPs, and these never overwrite an existing AIP. It is rather like software revision control where new versions just keep getting created. Although in the case of our AIPs we are not storing merely the changes between version N and N + 1; we are storing the complete version of each.

We have extensive workflow procedure documentation. A gent named Cole Whiteman is the author (artist?) for almost of the documents, and they are the products of many hours of conversation and whiteboarding with the ICPSR staff responsible for producing the content that ends up in AIPs.

Wednesday, May 25, 2011

April 2011 deposits at ICPSR

April deposits at ICPSR:

# of files	# of deposits	File format
88	13	application/msword
111	20	application/pdf
11	4	application/vnd.ms-excel
15	6	application/x-sas
27	11	application/x-spss
3	2	application/x-stata
1	1	application/x-zip
2	2	message/rfc8220117bit
1	1	text/html
1	1	text/html; charset=us-ascii
3	3	text/plain; charset=unknown
113	17	text/plain; charset=us-ascii
1	1	text/rtf

Nothing too exciting this month. Lots of the usual types of documentation formats (PDF, Word, plain text), and lots of the usual types of data (SAS, SPSS, Stata, and Excel).

Monday, May 23, 2011

Disaster Preparedness Planning - Zombie Apocalypse

Now this is the kind of disaster planning that I could really embrace:

Uncle Sam wants YOU to be prepared for a zombie apocalypse.

The Centers for Disease Control and Prevention, known best for stamping out health threats like Ebola and E. coli, is now advising people how to prepare for a zombie invasion.

A blog post by Assistant Surgeon General Ali Khan instructs readers to stock up on food and water, not to mention first aid supplies (“Although you’re a goner if a zombie bites you, you can use these supplies to treat basic cuts and lacerations that you might get during a tornado or hurricane,” the agency says).

From the WSJ Health Blog

Whenever we do this sort of thing, it's always "What if we lose power?" or "What if there is a really heavy snow?"

Harummph.

I say, "What if flesh-eating zombies attack, and they are looking for a side order of social science datasets?"

Friday, May 20, 2011

TRAC: B4.2: Storage and migration strategies

B4.2 Repository implements/responds to strategies for archival object (i.e., AIP) storage and migration.

At least two aspects of the strategy must be acted upon: that which pertains to how AIPs are currently stored (including physical requirements, media requirements, location of copies, formats and metadata) and that which may require AIP migration of any form. For example, AIP migrations that result in transformations of content need to be tracked to allow subsequent users to understand the repository’s processing implications.

If a repository has not yet needed to carry out any sort of preservation strategy on AIP(s), it must demonstrate that its policy has not required it yet.

Evidence: Institutional technology and standards watch; demonstration of objects on which a preservation strategy has been performed; demonstration of appropriate preservation metadata for digital objects.

Perhaps the biggest AIP migration so far at ICPSR has been the move from magnetic tape as the storage media to "spinning disk." Here's part of the story.

In 2005 ICPSR leased and managed three off-site storage locations. Two of the locations were small, and contained older magnetic tape formats, such as IBM 3480 cartridge. One of the locations was quite large ("the warehouse"), and that location had an assortment of tapes (IBM cartridge, 9-track, and more modern DLT) and paper. The paper was a mix of old copies of content ICPSR used to distribute via post (e.g., codebooks from the 1980s and earlier) and backup material related to the born-digital content (e.g., letters from researchers about their datasets).

A member of my team (Asmat Noori, who manages IT operations) lead an effort with four goals:

Move digital content from tape to disk
Discard old distribution paper content (the old codebooks)
Transfer archival paper to Iron Mountain for safe-keeping
Close the three off-site locations

Asmat's team consists of a handful of full-time ICPSR staff, a few student temps, and part of a software developer who would build tracking systems, matching boxes of content to Iron Mountain locations, and matching old media to new.

The team worked through two projects simultaneously. The first was to clear out all of the superfluous paper from the warehouse location. There were an enormous quantity of paper to discard, and we worked with a local company to recycle as much as we could. The second was to retrieve the magnetic tapes in batches so that they could be copied to disk, verified, and then discarded. For each tape we captured its table of contents, and the content on the tape. We performed essential sanity-checking on the restored content, and for each file, recorded the source (e.g., file X came from tape Y). The entire process took a bit under two years.

After moving the digital content to disk, we started two new activities: weekly fixity checks of each item, and daily copying of content to remote locations. These new activities gave us more copies, in more locations across the United States, with greater confidence that each copy is in good order.

The printed material still gets some use, but not that often. Storing the material with Iron Mountain, and paying for occasional retrieval, costs much, much less than operating a warehouse and paying a staff member (even a part-time temp) to move content between a warehouse and ICPSR.

And there's the evidence of implementing a storage and migration strategy.

Thursday, May 19, 2011

Google and Facebook join MyData as one of the ways to login to ICPSR

After an unexpected delay we finally rolled out two new ways to authenticate to the ICPSR web site. In addition to using the original ICPSR MyData account, web site visitors may now use either their Facebook account or Google ID to login to the ICPSR web site.

We wanted to make this type of service available because it is clear that most of our web site visitors who download data use it just once, or only infrequently. And so for every researcher or scholar that uses ICPSR on a regular basis, there are two casual users who just want to download a dataset as quickly and easily as possible.

For this latter audience MyData doesn't make sense. Why create a new account with a new password for something that may never be used again?

Now, people who would like to use their Facebook account or Google ID to identify themselves may do so. And, better still, if the person is already logged into either service, they do not have to login again to ICPSR. (This is often called SSO for Single Sign-On.)

We also looked at supporting the OpenID service, but our experience was that OpenIDs were a little too confusing for people. The idea that one's identity is a URL rather than an email address is less common, and it felt too foreign to most people.

We hope the new login options are useful, and make it even easier to use ICPSR.

Tuesday, May 10, 2011

ICPSR has been busy chatting with our good friends at DuraSpace over the past month or two. We have been an active member of their DuraCloud pilot. This is a hosted service for content and services where the big cloud providers deliver the compute and the storage, and DuraSpace delivers the software and services. Our main use has been as a supplement to our archival storage solution.

The project looks like it will soon finish the jump from "research pilot" to "production service." We participated in a webinar yesterday which introduced the updated management console. It looks good, but we did volunteer one feature request: an "financial administrator" role. The idea is that a login assigned to this role would have read access to invoices and financial statements, but not have any access to the content, services, etc. This is a role we would love to have with the Amazon AWS IAM stuff, but the Amazon guys still haven't identified the monthly bill as one of the system elements that would benefit from such a role. (And so that means that someone like me has to navigate through the management console to grab billing information each month, then save it, upload it into the absolutely ghastly UMich financial systems, and ....)

DuraCloud is a nice fit for ICPSR since it gives us a single management interface for syncing content to multiple cloud providers (Amazon and Rackspace today, but Microsoft down the road), and for invoking preservation-oriented services over the content, such as fixity checking.

You can find more info about DuraCloud on their web site, and a nice little piece they wrote about ICPSR too.

Wednesday, May 4, 2011

TRAC: B4.1: Employing documented preservation strategies

B4.1 Repository employs documented preservation strategies.

Documented preservation strategies include evidence of planning for strategies not yet employed against the repository’s digital objects. A repository is likely to employ multiple strategies. Different strategies may be employed by class (type) of digital object, and/or multiple strategies may be employed on a single object class. This will depend upon local repository policies and practices, though any such strategy decisions should be documented and should be based on sound community practice.

Minimally, documentation of preservation strategies must be included in repository policies and practices. Good repository practice also requires that preservation strategies employed against digital objects are recorded in the object’s preservation metadata. (See also B3.3.)

Evidence: Documentation of strategies and their appropriateness to repository objects; evidence of application (e.g., in preservation metadata); see B3.3.

ICPSR tends to use a single preservation strategy (normalization) for its content, perhaps because it is relatively uniform - survey data and documentation. There's a nice explanation of this strategy on the web site here, which also defines the term.

I wasn't able to find a document which mapped a content type to a specific, normalized format, and so to make our documentation a bit more complete, I'll offer such a table here:

File type	Specific strategy
Survey data (original format varies, but is often Excel, SAS, Stata, or SPSS	Normalize to plain character data (ASCII) + "setup" files, one for each of the major statistical analysis packages
Study-level and dataset-level metadata	DDI v2 XML
Technical documentation	Both PDF and TIFF; would like to transition to PDF/A where possible
Other textual artifacts, such as a user guide or questionnaire	Plain text or PDF or TIFF

If and when ICPSR really jumps into new types of content, such as video or still images, clearly those types of content will need different strategies.