Thursday, September 30, 2010

Designing Storage Architectures for Digital Preservation - Day One, Part Two

The second session of the first day featured technologists from higher education who either operate large archives, or who build systems for operating an archive.

Cory Snavely (University of Michigan, Hathitrust) gave a brief overview of Hathitrust, a repository of digital content shared by many of the Big Ten schools and a few other partners.

Brad McLean (Duraspace) reported on DuraCloud and results from the initial pilot partners.  (ICPSR is part of the current pilot, but was not a member of the original, smaller pilot program.)  He noted theseconcerns about using the cloud for digital preservation:
  1. Some services (such as Amazon's S3) have limits on the size of objects (files)
  2. Bandwidth limits on a per-server basis can impede function and performance
  3. Large files are troublesome
  4. Performance across the cloud can vary widely
  5. (File) naming matters; some storage services limit the type of characters in a name
Brad reiterated a comment made by several others:  A standard for checksums would be good to have.

Matt Schulz (MetaArchive) updated us on the MetaArchive, including a current partnership with Chronopolis.

David Minor (San Diego Supercomputer Center) updated us on the Chronopolis project.  David noted that SDSC is reimplementing its data center, and described three levels of storage in its future architecture:
  1. High-performance storage for scratch content
  2. Traditional filesystem storage
  3. Archival storage
The follow-on discussion included conversations about the right type of interface to access content in archival storage (POSIX, RESTful, object-oriented, etc); the trade-off between using long-lived media and systems for digital preservation v. taking advantage of advances in technology by using short-lived media and systems; and, David Rosenthal reminded everyone that we "... cannot test large systems for zero media failures."

I'll write-up my notes from Day Two early next week.

Wednesday, September 29, 2010

Designing Storage Architectures for Digital Preservation - Day One, Part One

I attended an event on Monday and Tuesday of last week that was hosted by the Library of Congress: Designing Storage Architectures for Digital Preservation.  I also attended the event last year, and so this was my second time attending.

Like last time there were many speakers, each giving a five minute presentation.  Unlike a TED talk where the presentation materials are built specifically to fit well within five minutes, many speakers had conventional slide decks, and raced through them quickly.  Those tended to be the weaker talks since the scope of the material was far too broad for the time allotted.  After a series of presentations there would be group discussion for 15-30 minutes which ran the gamut from interesting and provocative observations to chasing down rabbit holes.

I know the LoC will post complete information about the event, but here is my abbreviated version.  I've tried to hit what I considered to be the highlights, and so the reader should know that this report isn't complete.

The session opened with a video that argued that the Internet gives us more opportunity to innovate since it lowers the barrier for one's "hunches" to "collide" with those of another, and that innovation occurs when two or more good ideas come together.  Henry Newman then gave a framing overview for the meeting that included these interesting points.

  1. IT components are changing/improving at different rates; for example, processors are getting faster more quickly than buses are getting faster
  2. The preservation community and the IT community use different language to talk about archival storage
  3. Preservation TCO is not well understood
  4. The consumer market is driving the storage industry, not the enterprise market
The first of two sessions featured "heavy users" who spoke about some the challenges they faced.  The speakers included Ian Soborhoff (NIST), Mark Phillips (University of North Texas), Andy Maltz (Academy of Motion Picture Arts and Sciences), Ethan Miller (University of California - Santa Cruz), Arcot "Raja" Rajasekar (San Diego Supercomputer Center), Tom Garnett (Biodiversity Heritage Library), Barbara Taranto (New York Public Library), Martin Kalfatovic (Smithsonian Institution), and Tab Butler (Major League Baseball Network).  Highlights of their presentations and the follow-on discussion:
  • Experienced recent sea change where it was no longer possible to forecast storage needs whatsoever
  • "Archival storage... whatever that is."
  • Pergamum tome technology looks very interesting for smart, low-power storage
  • iRODS main components:  data server cloud, metadata catalog, and the rule engine
  • "Open access is a form of preservation."
  • If one needs N amount of space for one copy of archival storage, one also needs 2 x N or 3 x N for the ingest process
  • The "long now"
  • The MLB Network data archive will consume 9000 LTO-4 tapes for storage in 2010.
  • "Digital preservation sounds like hoarding."
  • "After our content was indexed by Google, usage went up 10x."
  • Data recovery from corrupted files is a digital preservation concern.
  • Forensics of a format migration is an effective tool for finding problems in a repository.
Next:  the second session of Day One.

Friday, September 24, 2010

TRAC: B1.6: Communicating with depositors

B1.6 Repository provides producer/depositor with appropriate responses at predefined points during the ingest processes.

Based on the initial processing plan and agreement between the repository and the producer/depositor, the repository must provide the producer/depositor with progress reports at specific, predetermined points throughout the ingest process. Responses can include initial ingest receipts, or receipts that confirm that the AIP has been created and stored. Repository responses can range from nothing at all to predetermined, periodic reports of the ingest completeness and correctness, error reports and any final transfer of custody document. Depositors can request further information on an ad hoc basis when the previously agreed upon reports are insufficient.

Evidence: Submission agreements/deposit agreements/deeds of gift; workflow documentation; standard operating procedures; evidence of reporting back.

ICPSR updates the depositor during the ingest process at two main points.

One, after the deposit is signed, ICPSR generates an inventory of the deposited content, and communicates this via email. This gives the depositor the opportunity to identify any content that was uploaded unintentionally or that may have become corrupted. These inventory reports are generated automatically by the deposit system.

Two, if deposited material is later used to produce an ICPSR study, the depositor is notified when that study is made available on ICPSR's web site, and when a normalized version of the content is moved into archival storage.

Thursday, September 23, 2010

ICPSR web site will be unavailable briefly on Monday, September 27, 2010

We've scheduled some maintenance around noon (EDT) on Monday, September 27, 2010.  We normally like to perform this type of work during off-hours, but this particular task is likely to be short-lived (10-15 minutes) and is best performed when we have our full team of software developers available.

My apologies in advance for any inconvenience.

P3P - Platform for Privacy Preferences Project

I came across an interesting paper on P3P (P3P is the Platform for Privacy Preferences Project) which is a W3C standard for expressing the privacy policies of a web site. The paper is from the CMU CyLab, and can be found here (PDF format).

A primary user of P3P is the Internet Explorer browser. It uses a "short form" of the policy to make decisions about whether a web site meets the security criteria one may set in the browser. Since most people never bother to configure different security levels for different sites, in practice any P3P descriptions that match "Medium security" will pass the security check.

The brief summary of the paper is that many of the top sites do not use P3P. Or, if they do use it, they make mistakes in the policy which will confuse browsers. And worse still, some sites seem to use P3P to actively trick browsers into thinking the site gathers no private information when it in fact does.

The paper is long, but many pages are part of an appendix. The main section of the paper is relatively short, well written, and is an interesting read.

Wednesday, September 22, 2010

DuraCloud fixity service testing

Our DuraCloud pilot test is going well. We have uploaded a test collection of nearly 70k files, representing that portion of our archival content that contains public-use datasets. (The datasets are public-use, but our licensing terms restrict access to some of these to our member institutions.)

To the left you can see a snapshot from the DurAdmin webapp that one uses to manage content. I've been using this webapp to view content, check progress, and download files. I've been using a command-line utility called synctool for copying content from ICPSR into DuraSpace, and keeping it synchronized.

The image to the left is the right-side panel from the Services tab of the DurAdmin webapp. I've deployed the Fixity service, and am using it to check the bit-level integrity of the content.

I started the service earlier this morning, and it still has quite a bit of work left to do. The processing-status line shows that the service has started, and that it is checked about 4300 of the files so far.

Monday, September 20, 2010

ICPSR increases security of its web transactions

We'll be making a small, but important, configuration change on our web server this week. For a long time we've allowed so-called "weak" ciphers to be used with HTTP connections over SSL (aka HTTPS). This was good for web site visitors who had very old browsers; so old that the browser did not support stronger SSL ciphers. But it is bad news for most of us who are running more recent software since it would allow one to use less robust encryption when exchanging content via HTTPS.

We've been running this newer configuration for many months on a web server we use for staging new content. The many browsers and platforms we use to test new web pages and software work well with this configuration, and so we've decided to move it into the production environment.

Wikipedia has a nice page that describes the technical details behind the various ciphers that are used with SSL (and its successor TLS).

Friday, September 17, 2010

TRAC: B1.5: Gaining control of deposits

B1.5 Repository obtains sufficient physical control over the digital objects to preserve them.

The repository must obtain complete control of the bits of the digital objects conveyed with each SIP. For example, some SIPs may only reference digital objects and in such cases the repository must get the referenced digital objects if they constitute part of the object that the repository has committed to conserve. This will not always be the case: scholarly papers in a repository may contain references to other papers that are held in a different repository, or not held anywhere at all, and harvested Web sites may contain references to material in the same site or different sites that the repository has chosen not to capture or was unable to capture.

Evidence: Submission agreements/deposit agreements/deeds of gift; workflow documents; system log files from the system performing ingest procedures; logs of files captured during Web harvesting.

This requirement is fairly straight-forward for ICPSR, given the type of content that we collect and curate. We gain complete control of the entire deposit, including both the research data and documentation. The deposit may also contain core related materials like the questionnaire that was used to collect the data.

That said, a deposit may be related to other objects outside the scope of ICPSR, such as publications related to the data. In this case ICPSR is not expecting to find such content in the deposit, nor would we tend to curate it even if it was present.

Tuesday, September 14, 2010

Amazon introduces new "micro" instances

ICPSR is taking advantage of a new "micro"-sized virtual machine offered by Amazon Web Services (AWS). Amazon describes the new instance this way:
Micro Instance 613 MB of memory, up to 2 ECUs (for short periodic bursts), EBS storage only, 32-bit or 64-bit platform
This looked like a good fit for the "stealth" DNS server that we run in Amazon's cloud, and so we converted it from a Small Instance - Reserved ($350 for a three year term + $0.03/hour) to a Micro Instance - Reserved ($82 for a three year term + $0.007/hour).

We have other lightly used instances running in Amazon's cloud, and we'll likely convert them over too.

Monday, September 13, 2010

A snapshot in time: top downloads in August 2010

This is a top-25 list of top downloads (by dataset) at ICPSR in August. Note the heavy interest in datasets from our demography archive, including a lot of interest in the public-use datasets belonging to Add Health.





1 130 ICPSR Study 25383: American National Election Study, 2008: Pre- and Post-Election Survey Dataset 1: American National Election Study, 2008: Pre- and Post-Election Survey
2 124 ICPSR Study 9323: Comparative Project on Class Structure and Class Consciousness: Core and Country-Specific Files Dataset 11: User's Guide, Vol. II - Supplementary Codes
3 116 DSDR Study 21600: National Longitudinal Study of Adolescent Health (Add Health), 1994-2002 Dataset 1: Wave 1, Public Use Data
4 116 SAMHDA Study 26701: National Survey on Drug Use and Health, 2008 Dataset 1: National Survey on Drug Use and Health, 2008
5 113 DSDR Study 21741: Chinese Household Income Project, 2002 Dataset 1: Urban Individual Income, Consumption, and Employment Data
6 109 DSDR Study 21741: Chinese Household Income Project, 2002 Dataset 2: Urban Household Income, Consumption, and Employment Data
7 105 DSDR Study 21741: Chinese Household Income Project, 2002 Dataset 3: Urban Individual Annual Income Data (Survey Appendix)
8 96 DSDR Study 21741: Chinese Household Income Project, 2002 Dataset 7: Rural Household Income, Consumption, Employment, Social Network, Quality of Life, and Village Affairs Data
9 93 DSDR Study 21741: Chinese Household Income Project, 2002 Dataset 4: Urban Household Assets, Expenditure, Income, and Conditions Data (Survey Appendix)
10 92 DSDR Study 21600: National Longitudinal Study of Adolescent Health (Add Health), 1994-2002 Dataset 2: Wave 1, Grand Sample Weights, Public Use Data
11 92 DSDR Study 21741: Chinese Household Income Project, 2002 Dataset 6: Rural Individual Income, Consumption, and Employment Data
12 91 DSDR Study 21600: National Longitudinal Study of Adolescent Health (Add Health), 1994-2002 Dataset 10: Wave 3, Public In-Home Questionnaire, Section 24 Data
13 90 DSDR Study 21600: National Longitudinal Study of Adolescent Health (Add Health), 1994-2002 Dataset 3: Wave 2, Public Use Data
14 90 DSDR Study 21600: National Longitudinal Study of Adolescent Health (Add Health), 1994-2002 Dataset 5: Wave 3, Public In-Home Questionnaire, Section 17 Data
15 89 DSDR Study 21741: Chinese Household Income Project, 2002 Dataset 9: Rural-Urban Migrant Individual Data
16 89 DSDR Study 21741: Chinese Household Income Project, 2002 Dataset 10: Rural-Urban Migrant Household Data
17 88 DSDR Study 21600: National Longitudinal Study of Adolescent Health (Add Health), 1994-2002 Dataset 7: Wave 3, Public In-Home Questionnaire, Section 19 Data
18 88 DSDR Study 21600: National Longitudinal Study of Adolescent Health (Add Health), 1994-2002 Dataset 8: Wave 3, Public In-Home Questionnaire, Section 22 Data
19 88 DSDR Study 21741: Chinese Household Income Project, 2002 Dataset 5: Village Administrative Data
20 88 DSDR Study 21741: Chinese Household Income Project, 2002 Dataset 8: Rural School-Age Children Data
21 87 DSDR Study 21600: National Longitudinal Study of Adolescent Health (Add Health), 1994-2002 Dataset 4: Wave 3, Peabody Picture Vocabulary Test Score Data, Public Use
22 87 DSDR Study 21600: National Longitudinal Study of Adolescent Health (Add Health), 1994-2002 Dataset 14: Wave 3, Public Use Education Data
23 87 DSDR Study 21600: National Longitudinal Study of Adolescent Health (Add Health), 1994-2002 Dataset 18: Network Variables Data, Public Use
24 86 DSDR Study 21600: National Longitudinal Study of Adolescent Health (Add Health), 1994-2002 Dataset 12: Wave 3, Public In-Home Questionnaire
25 85 DSDR Study 21600: National Longitudinal Study of Adolescent Health (Add Health), 1994-2002 Dataset 13: Wave 3, Public Use Grand Sample Weights

Friday, September 10, 2010

TRAC: B1.4: Checking completeness and correctness

B1.4 Repository’s ingest process verifies each submitted object (i.e., SIP) for completeness and correctness as specified in B1.2.

Information collected during the ingest process must be compared with information from some other source—the producer or the repository’s own expectations—to verify the correctness of the data transfer and ingest process. The extent to which a repository can determine correctness will depend on what it knows about the SIP and what tools are available for verifying correctness. It can mean simply checking that file formats are what they claim to be (TIFF files are valid TIFF format, for instance), or can imply checking the content. This might involve human checking in some cases, such as confirming that the description of a picture matches the image.

Repositories should have established procedures for handling incomplete SIPs. These can range from rejecting the transfer, to suspending processing until the missing information is received, to simply reporting the errors. Similarly, the definition of “completeness” should be appropriate to a repository’s activities. If an inventory of files was provided by a producer as part of pre-ingest negotiations, one would expect checks to be carried out against that inventory. But for some activities such as Web harvesting, “complete” may simply mean “whatever we could capture in the harvest session.” Whatever checks are carried out must be consistent with the repository’s own documented definition and understanding of completeness and correctness.

Evidence: Appropriate policy documents and system log files from system performing ingest procedure; formal or informal acquisitions register of files received during the transfer and ingest process; workflow, documentation of standard operating procedures, detailed procedures; definition of completeness and correctness, probably incorporated in policy documents.

For this post I am going to focus on the files that a depositor uploads to our web site. The other elements of the deposit, which are largely metadata, are collected automatically by the deposit system. It's only a title or name for the deposit that the depositor must provide.

As mentioned in an early post on deposits, we collect several pieces of information about each file. One item we collect is the MIME type, and we do this using the UNIX file utility, but where we have expanded the magic database to include information about common statistical packages and Microsoft Office file formats. For example, a vanilla, out-of-the-box version of file will report a DOCX format file as a Zip file, and while that is correct on some level, it wasn't the best description of the file.

After making this inventory, our system generates an auto-reply to the depositor enumerating the files and what we think they are. Note that we do not assign any higher level purpose (e.g., this is a data file; this is a codebook) programatically.

Assuming that the depositor does not find anything amiss with our inventory, a data manager will pick up the deposit, and will start preparing the materials for preservation at ICPSR, and for distribution on our web site.

A couple of things we are NOT doing, but which might be valuable in the future....

One, in addition to using file to report MIME types, we might also run a tool like JHOVE to inspect the correctness of the file formats. We do get a fair number of PDF format documents, and JHOVE does a nice job with those. However, we also get a lot of plain text documents, stats files, and MS Office files, and JHOVE doesn't do a very nice job with those. I've wondered if we might be able to write a small grant where ICPSR would promise to build plug-ins for JHOVE for the stats package formats.

Two, in addition to reporting file formats to our depositors, we might also report checksums for each file so that they would have the opportunity to inspect that the deposit went without error.

Thursday, September 9, 2010

Storage Architectures for Digital Collections 2010

I'm heading to DC for the Library of Congress's 2010 edition of their Storage Architectures for Digital Collections event. Should be another interesting meeting, and I'll be sure to include a brief summary after the meeting. (The LoC also produces a very nice, detailed summary, but it usually appears several weeks after the event.)

Wednesday, September 8, 2010

A snapshot in time: deposited files in August 2010

A typical month (for the summer) of deposits at ICPSR by file format type:

# files File format# of deposits
1text/plain; charset=iso-8859-11
24text/plain; charset=unknown4
423text/plain; charset=us-ascii18
8text/x-c; charset=us-ascii2
1text/x-mail; charset=us-ascii1

Lots of plain text and PDF. Plenty of files in the typical stat packages. A handful of Microsoft Office formats.

Our main tool for automagically calculating MIME types is file, and almost certainly the files it identified as C program text are actually just plain text, or maybe a setup file.

Tuesday, September 7, 2010

Post #100 for techaticpsr and a goodbye

Well, well, well, post number one hundred. The first big milestone post in a blog.

The big news this week is also some sad news here at ICPSR. Our colleague Felicia LeClere has left ICPSR for new opportunities. Felicia was at ICPSR for over five years, and I came across this item in our archive of announcements:
We are pleased to announce that Dr. Felicia LeClere will join ICPSR and the Michigan Population Studies Center as an Associate Research Scientist and Director of the Data Sharing for Demographic Research project (DSDR). She brings a strong record of research accomplishments coupled with extensive experience in data collection, data processing, and project management. Dr. LeClere will lead the project part time during the summer and will begin a full-time appointment on September 1, 2005. Dr. Felicia LeClere is currently Associate Research Professor in the Department of Sociology at the University of Notre Dame, where she also directs the Laboratory for Social Research. She received her Ph.D. in 1990 from Pennsylvania State University in the fields of demography and rural sociology. Prior to Notre Dame, Dr. LeClere held appointments at the National Center for Health Statistics and the U.S. Department of Agriculture. She has received numerous awards for her achievements, has a strong record in obtaining grants, and has published extensively in her field.
We're going to miss Felicia a lot at ICPSR; she was a strong advocate for improving the experience of using ICPSR through improved technology and new services. Several members of my team had worked closely with Felicia during her tenure her. Some of the most recent tech projects I've blogged about here - like the Restricted Contracting System and our experiment in making restricted-use data available via a cloud-based platform - are among the projects Felicia championed. And unlike most of the archive projects at ICPSR, Felicia's projects routinely contained a line-item in the budget to support a fraction of a software developer who could then provide sustained, individual support to her projects, such as building custom tools for managing complex datasets, or developing customized interfaces for delivering data or browsing content.

I've told several people (including Felicia) that she's leaving behind a stronger ICPSR than she joined back in 2005, and one of the major reasons that it is stronger is due to the leadership and vision she provided.

Friday, September 3, 2010

TRAC: B1.3: Authenticating the source

B1.3 Repository has mechanisms to authenticate the source of all materials.

The repository’s written standard operating procedures and actual practices must ensure the digital objects are obtained from the expected source, that the appropriate provenance has been maintained, and that the objects are the expected objects. Confirmation can use various means including, but not limited to, digital processing and data verification and validation, and through exchange of appropriate instrument of ownership (e.g., submission agreements/deposit agreement/deed of gift).

Evidence: Submission agreements/deposit agreements/deeds of gift; workflow documents; evidence of appropriate technological measures; logs from procedures and authentications.

ICPSR's Deposit Form is the mechanism that ensures proper documentation and workflow for deposited materials. It ensures that we have a signature from the data depositor, and keeps track of what was deposited, who did it, and when it was done.

One feature of the Deposit Form is that it auto-emails the depositor with a list of the files it thinks have been deposited, and each file includes a one-line description regarding the file format (e.g., SPSS Portable File, PDF document, version 1.6, or Microsoft Word). If the deposited file is a Zip package or a tarball, we pull apart the bundle first, and then examine the individual files before generating this report.

The person who signs the deposit must have a valid MyData account, and we use this to track and record identity. Of course, this same information is also helpful during the data management process if ICPSR needs to contact the data producer or researcher.

Thursday, September 2, 2010

Teaching With Data portal

We've launched a new version of our Teaching With Data portal.

TwD is one of the National Science Digital Library (NSDL) pathways. Each pathway has a domain-specific focus or interest, and ours is using quantitative social science research data as an element of teaching.

Jane Wang is the lead software developer on TwD, and she has been building the site using a Fedora stack called Muradora. One of the design requirements from the research team was that the site had to be based on Fedora, and rather than build a complete portal from scratch, Jane selected the stack that would be the best fit for our development environment.

Our TwD site is also interesting in that it is the first portal that we've deployed completely in the cloud; there is no local hardware at ICPSR supporting the project (except for Jane's laptop). All of the content and software resides in Amazon's cloud.

Another interesting aspect of this release is that we're using OpenID for authentication. This means that instead of needing to create yet another login and password to remember, visitors may instead use an existing login and password from one of many OpenID-compliant sources, such as Google, AOL and Yahoo. We're using a third-party service called RPX from JanRain to enable this feature. I'm expecting that we'll enable OpenID and other authentication systems (such as Facebook Connect) on the main ICPSR web site in the near future.

Of course, the site has other new features as well, and I'd like to encourage you to have a look by clicking the image above.

Wednesday, September 1, 2010

ICPSR launches the Restricted Contract System portal

The ICPSR Restricted Contract System (RCS) portal is officially open for business. We launched the new portal late in August, making the National Survey of Parents and Youth available through the system.

The portal is the researcher-facing piece of the system, and we use it to guide the researcher through the contract process of applying for access to restricted-use data. The system is highly configurable which allows us to collect information through traditional web forms and document uploads (e.g., proof of IRB approval, where required).

One key innovation with the new system is an attempt to make the IT security portion of the process as painless as possible. Historically ICPSR and other data providers have required researchers to submit detailed IT security plans for protecting the data, a process which often required a great deal of labor, but which did not actually make any actual measurements about security. In the RCS we've replaces the IT security plan with three new components.

One, we pull questions from our "Question Bank" that are tailored to the specific IT environment of the researcher (e.g., Windows machine connected to the Internet) and to a specific person: the researcher or the researcher's IT person. For example, one question might ask the researcher to confirm that s/he will lock the office door when the data are unattended. And another question might ask the IT person to confirm that the data will be kept in a place where they will not be backed up to tape for disaster recovery purposes.

Two, we ask the researcher to install and run an audit utility which inspects the computer for common security problems. The software does NOT require administrative access for installation or to run, and we limit its checking to a small number of essential areas, such as checking to see if a screen saver with password has been enabled.

Three, we also partner with the University of Michigan to run a remote vulnerability scan of the computer(s), looking for common problems which can be exploited remotely by attackers.

If the questions are answered appropriately, and if the audit and scan do not reveal any problems, then the researcher has completed the IT security portion of the process, and no written IT security plan is required. (We do, however, give researchers the option of writing an IT security plan if they would rather not submit to the scan and audit.)

The goal of the new portal is to lower the barrier for accessing restricted-use data, but still collecting enough information to ensure that the data will be safe.

The complete RCS suite of software also includes internal utilities to automate the contract administration process, such as generating reminder emails about contract renewal.