Technology at ICPSR: September 2011

Friday, September 30, 2011

Designing Storage Architectures for Digital Preservation

This is the second part of a two-part post about the 2011 Designing Storage Architectures for Digital Preservation meeting hosted by the Library of Congress. The first part can be found in this post.

The second day began with a second session on Power-aware Storage Technologies.

Tim Murphy (SeaMicro) spoke about his lower-power server offering, noting that "Google spends 30% of its operating expenses on power" and how it "costs more to power equipment than to buy it." Dale Wickizer (NetApp) gave a talk on how Big Data is now driving enterprise IT rather than enterprise applications or decision support. Ethan Miller (Pure Storage) described his lower-power, flash-based storage hardware, and how a combination of de-dupe and compression makes it cost comparably to enterprise hard disk drives (HDD). Dave Anderson (Seagate) spoke about HDD security and how new technology aimed at encryption may make sense for digital preservation applications too.

The theme of the next session was New Innovative Storage Technologies.

David Schissel (General Atomics) presented an overview of their enhanced version of the old Storage Resource Broker (SRB) technology which they call Nirvana. Bob [did not catch his full name] (Nimbus Data) described his flash-based storage array, and how it applied the same techniques as conventional disk-based storage arrays, but with flash instead. John Topp (Splunk) described his product which struck me as a giant indexer and aggregator of log file content. Sam Thompson (IBM) spoke about BigSheets, which layers a spreadsheet metaphor on top of technologies like nutch, mapreduce, etc.

This theme continued into the next session.

Chad Thibodeau (Cleversafe) described his technology for authenticating to cloud storage in a more secure manner by distributing credentials across a series of systems. Jacob Farmer (Cambridge Computer) proposed adding middleware between content management systems and raw storage to make it easier to manage and migrate content. R B Hooks (Oracle) presented an overview of trends in storage technology, and noted that the consumer market, not IT, will drive flash technology. Marshall Presser (Greenplum) spoke about I/O considerations in data analytics.

The day ended with two closing talks.

Ethan Miller (UC Santa Cruz this time) spoke about the need to conduct research into how archival storage systems are actually used. He described results from a pair of initial studies. In the first, access was nearly non-existent, except for a one-day period where Google crawled the storage, and this one day accounted for 70% of the access during the entire time period of the study. In the second, 99% of the access was fixity checking. [I think this is how ICPSR archival storage would look.]

David Rosenthal (LOCKSS Project, Stanford University) presented a still evolving model of how one computes the long-term storage costs of digital preservation. The idea is that this model could be used to answer questions about whether to buy or rent (cloud), when to upgrade technologies, and so on. You can find the full description of the model at David's blog here.

Wednesday, September 28, 2011

Designing Storage Architectures for Digital Preservation

I attended the 2011 edition of the Library of Congress' Designing Storage Architectures for Digital Preservation meeting (link to 2010 meeting). Like previous events, this meeting was scheduled over two days, and featured attendees and speakers from industry, higher education, and the US government. This post will summarize the first day of the meeting, and I'll post a summary of the second day later this week.

The meeting was held in the ballroom of The Fairfax on Embassy Row in Washington, DC. About 100 people attended the event which began at noon on Monday, September 26, 2011. As at past meetings the first hour was devoted to registration and a buffet lunch.

The program began at 1:00pm with a welcome from Martha Anderson, who leads the National Digital Information Infrastructure and Preservation Program (NDIIPP) program for the Library of Congress (LC). She noted that since its inception in 2000, the program has funded 70 projects spread across 200 organizations, and that it is valuable for people to be able to step out of the office for a short time to step back, see the big picture in digital preservation, and get fresh perspectives. She described how change is the driving force in digital preservation, and characterized one big change as a shift from indexing content to processing content.

Two "stage setting" presentations followed.

Carl Watts (LC) described a massive migration underway at the LC where 500TB of content was moving from one storage platform to another. Henry Newman (Instrumental) described challenges facing the digital preservation community: data growth is greatly exceeding growth in hardware speed and capacity; POSIX has not changed in many years; nomenclature is not used consistently between digital preservation practitioners and vendors; and, the total cost of ownership for digital preservation is not well understood.

The theme of the first session was Case Studies from Community Storage Users and Providers.

Scott Rife (LC) described the video processing routine used at the LC Packard Campus, which handles over 1m videos and 7m audio files, 7TB/day of content, and 2GB/s of disk access. Jim Snyder (LC) also spoke about the Packard Campus, noting that he is trying to "engineer for centuries" where one generation hands off to the next generation. Cory Snavely (University of Michigan) and Trevor Owens (LC) gave an overview of the National Digital Stewardship Alliance (NDSA), and a more detailed report of what has been happening in the Infrastructure Working Group [note - I am a member of that WG] including preliminary results from a survey of members. Highlights: 87% of respondents intend to keep content indefinitely; 76% anticipate a infrastructure change within three years; 72% want to host content themselves; 50% want to outsource hosting (!); 57% are using or considering use of "the cloud"; and, 60% intend to work through the TRAC process.

Steve Abrams (California Digital Library) spoke about a "neighborhood watch" metaphor for assessing digital preservation success. Tab Butler (Major League Baseball Network) updated the audience on the staggering amount of video he manages (2500 hours of HD video each week with multiple copies/versions of many of the hours). Barbara Taranto (New York Public Library) described a migration where the content doesn't move; only its address changes (in a Fedora repository). Corey Snavey (Hathitrust this time) gave a second talk, updating the audience on text searching at the Hathitrust. More memory delivers better performance. Andrew Woods (DuraSpace) described some of the challenges his team has faced building storage services across disparate cloud storage providers.

The theme of the next session was Power-aware Storage Technologies.

Hal Woods (HP) forecast a shift to solid state drives (SSD) in the next 2-4 years, and speculated that tape might outlive hard disk drives (HDD). Bob Fernander (Pivot3) described video as the "new baseline" for content, and warned that we need to stop building Heath Kit style solutions to problems. Dave Fellinger (DataDirect Networks) advised that the building blocks of digital preservation solutions needed to be bigger, and building with the right-sized block would make it easier to solve problems. Mark Flournoy (STEC) gave a very nice overview of different SDD market segments, costs, and performance metrics.

Each session included a lengthy question, answer, and comment section, and sometimes lively debate amongst the audience.

The first day wrapped up a bit after 5pm.

Tuesday, September 27, 2011

Top Ten: No more rubbish meetings!

Several years ago, Deb Mitchell, the Director of the Australian Social Science Data Archive, visited ICPSR during one of our Council sessions. A bunch of us were bemoaning the number of meetings we attended, and how so many of them were so ill-focused. We felt that many of the meetings lacked a clear purpose or goal, had no agenda, and often included too many people (but often lacked the actual key stakeholders!). At the end of the conversation, Deb exclaimed:

No more rubbish meetings!

And that was our mantra for the rest of the month.

And so with that same spirit in mind, I present my top ten list of how to avoid the dreaded "rubbish meeting."

The meeting must have a goal. Example meeting goals are: we share information, we make a decision, or we discuss an issue that requires some conversation. Each goal has a different output, of course.
The meeting should end when the goal is reached.
The stakeholders MUST be at the meeting; the meeting cannot be productive without them.
Send the goal (or the agenda - which is a roadmap of how to reach the goal) far enough in advance of the meeting so that any necessary research can be completed.
If meeting participants will need to review documents in order to achieve the meeting goal, the documents must be sent well ahead of the meeting.
Come to the meeting prepared.
Summarize the decisions reached (if decision-making was the goal) at the end of the meeting. This sometimes takes the form of listing the action items. ("We decided that X will do Y...")
Size the meeting appropriately. If the goal is to brainstorm the requirements of a highly complex system with many moving parts, don't try to fit it into a single 30-minute meeting. Break it into smaller chunks, or schedule more time, like a day-long retreat (if it is important).
Do not rely on the "Subject" line of a meeting invite or email to convey the goal; be explicit in the body of the invite or the email.
Despite the best of preparations and intention, a meeting will sometimes head off into the weeds and cease to be useful. Never be afraid to pull the plug, and live to meet another day.

I'll post notes about the Designing Storage Architectures for Digital Preservation event - definitely not a rubbish meeting! - later this week.

Photo credit: http://vitaminsea.typepad.com/.a/6a00d83451d84969e2010535dbc2a6970c-320wi

Friday, September 23, 2011

TRAC series

I've been posting about TRAC @ ICPSR on most Fridays over the summer. At this point (if you go far enough back in the blog's history) you'll find an entry for every item in sections B and C.

I started with section C since it is very heavy on technology, and then moved to section B since it too contains a lot of technology-oriented (or technology-dependent) items. I'm planning to move onto the shakier ground (technologically speaking) of section A starting in October.

Wednesday, September 21, 2011

ICPSR Web availability through August 2011

I had been publishing these monthly, but I don't think I've put one out lately.

Web availability has been pretty good so far this (ICPSR fiscal) year. The two main problems we had in July were on our Child Care and Early Education Research Connections portal (18 minute outage due to a software fault) and a slightly longer cross-portal problem when our JanRain Engage service stopped working. (JanRain Engage is the "social login" services that enables logging into ICPSR's web site with a Google ID or Facebook account.)

In August we saw a fault in our Oracle database server and the Solr search engine, each of which accounted for a brief outage. And a longer outage with Amazon Web Services US-EAST region was the final contributor to our downtime total.

Monday, September 19, 2011

Convergence

The always entertaining and highly informative Moxie Marlinspike gave a very interesting talk at BlackHat USA 2001 about SSL. This is the technology that (in theory) secures our communication channels on the Internet, keeping information like credit card numbers out of the hands of the bad guys. I've seen past talks by Moxie where he describes the many flaws with SSL, but in this talk he introduces a new solution called Convergence.

The talk is fascinating, and I highly recommend watching it. (It's on YouTube.) It's about 45 minutes long, so enjoy over lunch.

In brief, Moxie cites two problems with the current SSL model, which requires all of us to trust Certificate Authorities (CAs), which have been hacked with increasing frequency, and which have also demonstrated drunk and disorderly behavior at times. One, we have to trust them forever. Two, there is no reasonably way to change who you trust. For example, if one decided that Comodo (one of the largest CAs) just could not be trusted any longer, one could deleted Comodo from his/her browser's "trust database." But doing this would make a large number of Internet web sites (20%) unusable.

Convergence replaces CAs with one or more self-selected "notaries" each of which can use a different method to ascertain whether a certificate is valid, including a self-signed certificate. One may also use a "bounce notary" to separate those that know who you are from those that know where you are browsing.

Friday, September 16, 2011

ICPSR is a .........

I read an interesting article last week about Zynga, the company that makes many of the most popular games available at Facebook. (The article is behind the WSJ paywall, but here is a link that subscribers can use.)

The essence of the article is that Zynga has discovered a way to generate real revenues from virtual products, and that their extensive use of data and analytics have enabled this capability. This short paragraph caught my eye:

"We're an analytics company masquerading as a games company," said Ken Rudin, a Zynga vice president in charge of its data-analysis team, in one of a series of interviews with Zynga executives prior to the company's July filing for an initial public offering.

We often say the same sort of thing about ICPSR, particularly within the technology team.

This happens most often when we've just inked a new grant or contract with an organization. On the surface the agreement is all about science and investigation, promoting research, and enabling good data management. But just underneath there is a different story, one that often shows up in the budget. The project is, in fact, all about building technology, and will support a large team of web designers, software developers, business analysts, and project managers to define the scope of the deliverable, and then to build it. And this leads to:

We're a web development company masquerading as a data archive.

or something similar echoing in the halls outside the IT bay. Of course, it isn't true, but that doesn't stop us from saying it anyway. And, of course, one could reverse the roles:

We're a data archive masquerading as a web development company.

to get a different twist.

Do you ever describe your own organization in this way?

Wednesday, September 14, 2011

August 2011 deposits

Time for the monthly deposit statistics:

# of files	# of deposits	File format
1	1	application/msaccess
2	1	application/msoffice
130	22	application/msword
104	6	application/octet-stream
715	29	application/pdf
30	10	application/vnd.ms-excel
6	2	application/vnd.ms-powerpoint
1	1	application/x-dosexec
1	1	application/x-empty
23	7	application/x-sas
67	12	application/x-spss
14	7	application/x-stata
4	3	application/x-zip
6	2	image/jpeg
6	3	message/rfc8220117bit
34	6	text/html
5	3	text/plain; charset=iso-8859-1
8	4	text/plain; charset=unknown
420	28	text/plain; charset=us-ascii
1	1	text/plain; charset=utf-8
17	2	text/rtf
5	2	text/x-c; charset=unknown
7	1	text/x-c; charset=us-ascii
113	2	text/xml
2	1	very short file (no magic)

Lots of the usual kinds of stuff in August; maybe even a bit more than one would expect given the time of year.

There's the usual mistakes made by our file identity service; we're going to look at replacing or augmenting the current system (the UNIX file utility with a greatly expanded localmagic database + a wrapper that inspects the file extension) with something else. We've spent just a tiny amount of time tinkering with Tika from the Apache project, and that looks promising. This might even grow into a web service that we would share with others.

A couple of unusual items that merit closer inspection too, such as the purported DOS executable, and a bunch of (basically unrecognized) bitstreams.

Monday, September 12, 2011

DuraCloud Pilot talk at Educause 2011

Our colleague from DuraSpace, CEO Michele Kimpton, is giving a talk at this year's Educause conference. Here is a link to the abstract.

The topic is the DuraCloud pilot that concluded earlier this year, and in which ICPSR was an active participant. I've been impressed with the level of commitment and the service orientation of the entire DuraCloud team, and we've since moved from "pilot user" to "customer."

Friday, September 9, 2011

TRAC: B6.10: Linking DIPs to AIPs

B6.10 Repository enables the dissemination of authentic copies of the original or objects traceable to originals.

Part of trusted archival management deals with the authenticity of the objects that are disseminated. A repository’s users must be confident that they have an authentic copy of the original object, or that it is traceable in some auditable way to the original object. This distinction is made because objects are not always disseminated in the same way, or in the same groupings, as they are deposited. A database may have subsets of its rows, columns, and tables disseminated so that the phrase “authentic copy” has little meaning. Ingest and preservation actions may change the formats of files, or may group and split the original objects deposited.

The distinction between authentic copies and traceable objects can also be important when transformation processes are applied. For instance, a repository that stores digital audio from radio broadcasts may disseminate derived text that is constructed by automated voice recognition from the digital audio stream. Derived text may be imperfect but useful to many users, though these texts are not authentic copies of the original audio. Producing an authentic copy means either handing out the original audio stream or getting a human to verify and correct the transcript against the stored audio.

This requirement ensures that ingest, preservation, and transformation actions do not lose information that would support an auditable trail between the original deposited object and the eventual disseminated object. For compliance, the chain of authenticity need only reach as far back as ingest, though some communities, such as those dealing with legal records, may require chains of authenticity that reach back
further.

A repository should be able to demonstrate the processes to construct the DIP from the relevant AIP(s). This is a key part of establishing that DIPs reflect the content of AIPs, and hence of original material, in a trustworthy and consistent fashion. DIPs may simply be a copy of AIPs, or may result from a simple format transformation of an AIP. But in other cases, they may be derived in complex ways from a large set of AIPs. A user may request a DIP consisting of the title pages from all e-books published in a given period, for instance, which will require these to be extracted from many different AIPs. A repository that allows requests for such complex DIPs will need to put more effort into demonstrating how it meets this requirement than a repository that only allows requests for DIPs that correspond to an entire AIP.

A repository is not required to show that every DIP it provides can be verified as authentic at a later date; it must show that it can do this when it is required at the time of production of the DIP. The level of authentication is to be determined by the designated community(ies). This requirement is meant to enable high levels of authentication, not to impose it on all copies, since it may be an expensive process.

Evidence: System design documents; work instructions (if DIPs involve manual processing); process walkthroughs; production of a sample authenticated copy; documentation of community requirements for authentication.

ICPSR has a long and interesting history in the context of this TRAC requirement.

I would assert that for ICPSR's first few decades of existence it considered itself more of a data library than a digital repository. My sense is that there are not strong bonds between what one might call an AIP and a DIP from those early days.

Things seemed to change a bit in the 1990's, and I see evidence that the organization started to distinguish those items we received (acquisitions is the nomenclature used locally) from those items we produced (turnovers is the nomenclature we still use today). Content began falling into more of a simple hierarchy with the acquisitions being kept in one place, and the turnovers being kept in a different place.

Connections are still pretty loose in the 90's, and one has to infer certain relationships. Content is identified in the aggregate, rather than at the individual file level, and the identity of the person who "owned" or managed the collection figures prominently in the naming conventions. If the earlier times were the digital Dark Ages at ICPSR in terms of digital preservation practice, the 90's were the Middle Ages. Better, but still not modern.

When then-Director Myron Gutmann asked my team to automate much of the workflow in the mid 2000's (I came to ICPSR late in 2002), this began a process of building stronger connections between the content we received and the content we produced. This was necessary since a lot of information that was captured only on paper or in the heads of people now needed to be in databases and programs. Two people - Peggy Overcashier and Cole Whiteman - deserve most of the credit for this automation, but it was a very considerable team effort that involved many different parts of my team and ICPSR as a whole. To keep the metaphor going, perhaps 2006 was the Renaissance of digital preservation practice. And, not coincidentally, this was also the time that Nancy McGovern joined ICPSR as our Digital Preservation Officer.

My sense is that we now have good connections between AIP-like objects and DIP-like objections, but only at the aggregate level. Even today we do not have crisply defined AIPs and DIPs, and we do not have the relationships recorded at the file-level.

This is due to two main problems that we hope to address in a new project code-named FLAME. (This will be the subject of many future posts.)

One problem is that all of our DIPs are made by humans, and are made at the same time as the AIPs. A future workflow should support the automatic generation of DIPs from AIPs, and this would allow us, for example, to update many of our DIPs automatically in response to changes in versions of SAS, SPSS, and Stata.

The other problem is that when we automated systems in the mid-2000's we didn't really fix the processes. We made things faster, and we made things less error-prone, but we did not address some of the fundamental quirks in ICPSR's primary business processes. Changing these processes from their current "study"-centric view of the universe to one that is more "file"-centric (or "object"-centric) will be the next big challenge ahead. Stay tuned for details on this as we launch FLAME.

Wednesday, September 7, 2011

Another (32-bit) one bites the dust

In 2002 ICPSR had three main servers. All ran Solaris 8; one was a SunFire 280R used for testing out new web server software, and the other two were our production systems: a dedicated web server and a systems which did double-duty as an Oracle database server and a shared staff login machine. Both were bigger E9000 systems. All of the machines were pretty new at that time.

When the machines were tired and ready to be upgraded in 2005 or 2006 we made the move to Red Hat Linux and inexpensive, 32-bit Intel machines (mostly from Dell). Most of these initial machines have been retired over the past year or so, and only a handful remain. And, of course, they are the machines that deliver the most mission-critical services and so are the most difficult to upgrade.

By my count we now have over a dozen 64-bit Intel machines running RH and a only two 32-bit machines remaining (the web servers for our staging content and our production content). We retired the 32-bit machine that served as the primary data processing platform last week; it had been replaced by a 64-bit machine that runs inside our Secure Data Environment a few months ago.

Monday, September 5, 2011

Designing Storage Architectures for Preservation Collections

Tech@ICPSR will be heading to DC in late September to attend another annual installment of Designing Storage Architectures for Preservation Collections hosted by the Library of Congress (link to last year's meeting page). This promises to be another useful meeting, and as I've done in the past, I'll post some my notes from the meeting.

The past couple of meetings have given considerable attention to the cost of the hardware and software systems that supply the basic storage platform. There's usually a lot of interesting tidbits in those conversations, but my sense is that the people costs of ingest and curation are the major costs at ICPSR. I can't tell if that's unusual amongst this crowd (e.g., they have way more content than we do, and it requires far less human touch), or if it is the proverbial elephant in the room that no one mentions.

Friday, September 2, 2011

TRAC: B6.9: Fulfilling and rejecting requests

B6.9 Repository demonstrates that all access requests result in a response of acceptance or rejection.

Eventually a request must succeed or fail, and there must be limits on how long it takes for the user to know this. Access logs are the simplest way to demonstrate response time, even if the repository does not retain this information for long. However, a repository can demonstrate compliance if it can show that all failed requests result in an error log of some sort, and that requests are bounded in duration in some way.

Evidence: System design documents; work instructions (if DIPs involve manual processing); process walkthroughs; logs of orders and DIP production.

We capture this type of information in two places.

One location is the standard Apache httpd access and error logs. This records access attempts, failures, and much more information. These logs grow quite large, and we keep only a limited history for any length of time.

The other location is what we call our "order history" database. This database contains very detailed records of accepted requests, and we keep records indefinitely. Because our delivery system evolves over time, older records contain less information that newer records; for example, records that predate MyData accounts and profiles do not contain this information.