Google+ Followers

Tuesday, September 22, 2009

Designing Storage Architectures for Digital Preservation - Day One

This is the first day of a two-day workshop on storage architectures for digital preservation. The workshop is hosted by the Library of Congress at the Churchill Hotel in Washington, DC. There are about eighty or so attendees, many from the LoC itself, but also many "tool makers" (Sun, EMC, Seagate, Cisco, etc.) and "data stewards" (ICPSR, MetaArchive, HathiTrust, etc). My apologies in advance where I have misunderstood or misquoted a speaker below.

The workshop began at noon with a luncheon, followed by a brief Opening/Welcome session. This moved quickly to a 90-minute session, Storage for Digital Preservation: Panel of Case Studies from Users, which began at 1:15pm. There were eight, seven-minute presentations:
  1. Thomas Youkel, Library of Congress, cited some numbers about the amount of content ingested by the LoC, and the amount they project they will ingest during the next two years. One interesting figure is that the LoC ingested 24.7 TB during the week of June 2, 2009. He also described data integrity as a key challenge, and workflow, content management, and migration as secondary challenges.
  2. David Minor, San Diego Supercomputer Center, gave a brief overview of the Chronopolis project: three partners (SDSC, NCAR, UMIACS), 50TB of storage at each node. SRB is the content transport system; BagIt is the content container; and, ACE is the content integrity system. ICPSR, the California Digital Library, the MetaArchive, and one other organization I missed are the content providers. Chronopolis Next Generation is seeking additional storage partners (nodes), migration tools, and connecting to other storage networks, such as the MetaArchive's private LOCKSS network.
  3. Bill Robbins, Emory University, described how the MetaArchive was using an Amazon EC2 system as its central "properties server" to solve the (political? procedural?) issues of Emory serving as the "master node" for the MetaArchive. Bill had a good quote: "We're not cheap. We're 6x cheap, and that's not so cheap." Bill expressed general satisfaction with EC2, but wished the documentation was better.
  4. Andy Maltz, Academy of Motion Pictures Arts and Sciences, reference the Digital Dilemna in his talk about the requirements his organization has for digital preservation solutions: (1) last 100 years; (2) survive benign neglect; (3) at least as good as photochemical; and, (4) cost less than $500/TB/year. Andy also referenced the phrase "Migration is broken" from a 2007 SNIA report. He cited some figures: a movie consumes 2-10PB of storage, and Hollywood produces about one move/day. He finished with a brief description of StEM, an NDIIPP-sponsored project.
  5. Laura Graham, Library of Congress, described the LoC's efforts to preserve websites. The Internet Archive does the crawling, and the Wayback Machine is the delivery mechanism. A system at the LoC acts as archival storage. Wish list includes fewer manual steps in the system, and less of a need to copy files around quite so much.
  6. John Wilkin and Corey Snavey, Hathitrust (and the University of Michigan Library), gave a brief overview of Hathitrust. They're leveraging the OAIS reference model, plugging in modular solutions wherever possible. 185TB of storage today. Focus is on the "published record." Corey asserted that data stewards will need to be able to rely more and more on the storage solution (trust) in order to succeed in their missions.
  7. Jane Mandelbaum, Library of Congress, was the proxy for a very brief overview of the DuraCloud effort from DuraSpace. DuraCloud is essentially a middle layer between a variety of cloud storage providers and data stewards.
  8. Jimmy Lin, representing Cloudera, described Cloudera as the RedHat for Hadoop. Jimmy went on to talk a bit about Hadoop, HDFS, and MapReduce, and how Cloudera might be a very attractive platform for connecting "compute" to storage.
The session concluded with a general conversation about key issues in digital preservation: trust; costs; not knowing what bits will be considered valuable up-front; and, how frequency of access is unknown. David Rosenthal had a good line: "You have to get used to the idea of losing stuff." There's no magic bullet that will keep lots of bits around for a long time without any loss.

My main comment during the session was that software and hardware and even power are not the big costs of digital preservation (at least at ICPSR); the big costs are people, and the processes that require people.

After a short break, we began the next session, Storage Products & Future Trends: Vendor Perspectives, at 3:15. Again, the format was a seven-minute presentation:
  1. Art Pasquinelli, Sun Microsystems, spoke about the Sun PASIG, an d described how researchers were looking to IT and libraries for their digital preservation needs, and how that was simply not working.
  2. Mike Mott, IBM, asked how we define a "document" in a digital world, and thought that we would see the end of Moore's Law (in storage) by 2013 unless there was a new technological breakthrough. Mike also described a new paradigm in architecture: a river v. a building.
  3. Dave Anderson, Seagate, spoke on how he expected some trends to end (approx 40%/year increase each year in capacity + 20%/year increase each year in transfer rate); how solid state disk uptake has been slower than expected; and, how the change in disk form factor from the desktop (3.5") to the laptop (2.5") will shift the industry.
  4. Tim Harder, EMC, described a new "compute + storage" solution called Atmos, and how they are betting big on x86 technology + virtualization, and off-the-shelf gear bundled with software. EMC has also founded a cloud division.
  5. Paul Rutherford, Isilon, expects SATA, 3.5" form-factor disks, and block-level access to disappear, replaced by SAS, 2.5" form-factor disks, and file-level (or object-level) access. He said "I hate the cloud" and didn't think it would be used as the sole source for important data.
  6. Kevin Ryan, Cisco Systems, gave a high-level overview of Cisco's "unified fabric" vision which sounds like it consists of a single, lossless, open pipe for all sorts of bits: network, data, NAS, SAN, etc.
  7. Raymond Clarke, Sun Microsystems, gave a similar type of talk, but about the Sun Cloud which struck me as an umbrella term for an integrated solution using lots of Sun's open technologies: Solaris, Java, ZFS, MySQL, etc.
The session ended with another general conversation about trends. We then had a brief close-out (with homework!) to end the day about 4:45pm.


  1. This is really intereting stuff - not having much of a travel budget this year I was pleased to find this set of 3 webcasts from a presentation on Aug. 4, 2009 on Technical demonstration of integrated preservation infrastructure prototype. This is a presentation they did for a group at NSF.

  2. We've been using the a related technology, Storage Resource Broker (SRB), to copy content into both SDSC's grid, and then into Chronopolis. I haven't had the opportunity to get my hands dirty with iRODS yet, but I'm hoping to use it in a small demonstration project with Jon Crabtree at the Odum Institute.

  3. where i can find presentations or audio/video registrations?

  4. The Library said they'd have materials (PowerPoint decks + summaries from the three note-takers) on-line at on Monday. I don't have a more exact URL yet.

    Lots of good stuff on that site.

  5. Still no link from the LoC, but my guess is that when the materials are available on-line, the link will be on this page:

  6. The LoC has posted the materials:


Note: Only a member of this blog may post a comment.