Technology at ICPSR: 2009

Friday, December 25, 2009

TRAC: C1.9: Change testing

C1.9 Repository has a process for testing the effect of critical changes to the system.

Changes to critical systems should be, where possible, pre-tested separately, the expected behaviors documented, and roll-back procedures prepared. After changes, the systems should be monitored for unexpected and unacceptable behavior. If such behavior is discovered the changes and their consequences should be reversed.

Whole-system testing or unit testing can address this requirement; complex safety-type tests are not required. Testing can be very expensive, but there should be some recognition of the fact that a completely open regime where no changes are ever evaluated or tested will have problems.

Evidence: Documented testing procedures; documentation of results from prior tests and proof of changes made as a result of tests.

When I arrived at ICPSR in 2002 the same systems were used for both production services and for new development. In fact, the total number of systems at ICPSR was very small; for example, a single machine was our production database machine, our shared "time sharing" machine for UNIX-based data processing, our anonymous ftp server, and pretty much everything else.

Our model is quite different these days. Most of our software development occurs on the desktop of the programmer, and new code is rolled out to a staging server for testing and evaluation. The level and intensity of the testing varies widely at ICPSR; my sense is that public-facing systems get the most evaluation, and internal-only systems receive very little. Nonetheless, unless the evaluation reveals serious flaws, the software is then rolled into production on a fixed deployment schedule. Because all software resides in a repository, we can back out changes easily if needed.

The last six software developers we've hired have all worked in the Java environment, and we're in the process of moving our Perl/CGI team to Java as well. My sense is that getting all of the major systems in Java will make it easier to use unit-testing tools, like JUnit for example.

Monday, December 21, 2009

CNI Fall 2009 Membership Meeting

I gave a project briefing at the Coalition for Networked Information (CNI) Fall 2009 Membership Meeting on ICPSR's work on cloud computing and confidential research data. I have placed a copy of the presentation deck at SlideShare.

Most of the talks I attended were quite good, and the brief notes I've entered here are by no means complete summaries. But they will give people some flavor of the meeting, and the types of topics that one will find at a CNI meeting. I should note that I really find the meetings useful; it is a great way to keep up with what's going on at the intersection of IT, libraries, and data, and I usually meet several interesting people.

Opening Plenary - Overview of the 2009-2010 Program Plan (Cliff Lynch) - As usual Cliff opened the meeting and went through the CNI Program Plan for 2009-2010, hitting a wide array of topics including open data, open access, the financial crisis in higher education (particularly in the UC system), sustainability, linked data, the contrast between the centralized databases of the 70s, 80s, and 90s v. more diffuse collections today, and reaching deeper into membership organizations.

He drew a distinction between data curation (focus on re-use and the lifecycle) and data preservation (focus on long-term retention). My recollection is that he thought the former was more likely to attract community engagement, and the latter was a tough sell to funders, membership organizations, and business. I've heard others make similar comments, most recently Kevin Schurer from the UK Data Archive, who distinguished between research data management and data preservation.

Cliff then spoke about the usefulness of attaching annotations to networked information, perhaps in reference to a talk (which I wasn't able to attend) from the Open Annotation Community project later in the day.

Thorny Staples and Valerie Hollister gave a brief talk about DuraSpace's work to faciliate "solution communities" to help people solve problems using Fedora Commons and/or dSpace.

Randy Frank gave Internet2 kudos for creating good tech for demos and labs, but told the audience in his project briefing that he wanted to bring the tech closer to the production desktop at member institutions.

Simeon Warner described how arXiv would be soliciting its top downloaders for donations to help keep the service running. It's current host, Cornell University Library, spends about $400k per year (largely on people) for the service, and naturally they would like to find others to help pay for this community service.

Friday, December 11, 2009

TRAC: C1.8: Change management

C1.8 Repository has a documented change management process that identifies changes to critical processes that potentially affect the repository’s ability to comply with its mandatory responsibilities.

Examples of this would include changes in processes in data management, access, archival storage, ingest, and security. The really important thing is to be able to know what changes were made and when they were made. Traceability makes it possible to understand what was affected by particular changes to the systems.

Evidence: Documentation of change management process; comparison of logs of actual system changes to processes versus associated analyses of their impact and criticality.

Establishing and following a change management process is a lot like stretching before working out. We know we should do it, we feel better when we do, but, wow, is it ever hard to make a point to do it.

I've used a change management process in the past. At ANS Communications we would only make major configuration changes to the network on certain days and at certain times, and we would drive the changes from a central database. This was particularly important when we made a long and very painful transition away from the NSFnet-era equipment that had formed our backbone network to new equipment from a company known as Bay Networks.

At ICPSR I think we could use something fairly straight-forward. There are only a handful of critical software systems, and they don't change that often. We already track software-level changes in CVS, and we already announce feature-level changes to the designated community (e.g., ICPSR staff for internal systems), and so we might pull it all together by linking the announcements with the code changes in JIRA. I could also imagine a thread on our Intranet (which is Drupal-based) which could form a central summary of changes: what, when, how, and links to more details.

Monday, December 7, 2009

A Newbie's Guide to Serving on an NSF Panel

I had the opportunity to serve on a National Science Foundation review panel a while ago. It was quite an undertaking to read through many different kinds of proposals, re-read them to the point of really understanding what the proposal author was saying, and then to distill it down to a brief summary, the strengths and weaknesses, and how the proposal does (or doesn't) address the goals of the program well. Pretty exhausting!

But... It was great; very, very interesting to go through the process, see how the system works, and participate in the system. Having done it once, I'd love to do it again. And it certainly gives one a very fine appreciation for what to do (and what to avoid) in a proposal so that the job of the reviewer is made easier. Even something seemingly simple, like page numbers, ends up being pretty important.

The details of the actual panel (the specific panel, program, and set of reviewers), and the content of the proposals need to be kept confidential. However, I kept thinking "Hey, I wish I knew about that before I started on this." And so I offer my list of eleven things you should know, do, or not do if you are about to serve on your first review panel.

Do not print the proposals. You won't need them at the NSF, and you'll just need to shred them later. Everything you'll need is on-line in the Interactive Panel system. And there may not be much space at the panel to keep them hard-copies handy.
Do not bring a laptop. The nice folks at the NSF supply a laptop with a wired network connection, and convenient desktop icons to all of the stuff you'll need. If you bring your own laptop, you'll need to have the NSF IT guys scan it before they'll allow it to connect to the network. And it won't have all of the convenient desktop icons. (All this said, I still brought my little HP Mini to use at the hotel.)
Bring water. The government doesn't believe in bottled water, and if you like to drink plenty of the water during the workday, you'll wish you had ready access to some bottles.
Being a Scribe is a lot more work than a Lead. Reading and reviewing proposals is a lot of work. For some, you'll also be the Lead. That's actually not a big deal at all; it just means that you have to give a brief summary of the proposal to the other reviewers, and hit its strengths and weaknesses. Ideally you'll also drive the conversation around the proposal, but the NSF program officers are there to help out, and they are really good at this sort of thing. Now, for some proposals, you may be a Scribe, and that is a lot of work since it's up to you to take the minutes of the conversation, generate a nice summary that reflects all the key evaluation points, and then coax the rest of the reviewers to read and approve your work.
Invest plenty of time in your reviews. Take your time. Write good, clear prose. But be succinct. Make it easy to read so that your other reviewers can read through it quickly. You'll be glad you took the time to do this when you're the Lead. And the other reviewers will also thank you for it.
Keep your receipts. The government will be paying you via a 1099, and you'll want to deduct the expenses from your income.
Be sure to select the box on FastLane to disable the automatic screen refresh. If you don't do this, the system will refresh the screen on you at the most inopportune times.
Submit your reviews early. The program officers and other reviewers will be able to factor in your comments if they have them before the panel. Definitely do not wait until you get to the panel. And be sure that you use the Submit Review button, not Save Review button, once your review is ready.
Be very clear about the review criteria. In addition to any standard criteria, they may also be program-specific criteria, and perhaps additional specific areas which will require extra attention.
The Hilton Arlington is very, very close to the NSF. There is even a covered, overhead pedestrian walkway from the Hilton to the Stafford Building. This can be very nice if it is raining.
The NSF is at 4201 Wilson Blvd. It is in a large, shared office building called Stafford Place. I'm not sure that I ever found the street address in any convenient page on the NSF Visitors page.

Friday, November 20, 2009

ICPSR and the Cloud

Interest in the cloud is heating up on the University of Michigan campus. In the past day or two I've seen surveys asking campus IT leaders to comment on their exploratory interest in the cloud, and have also answered email queries about who is dabbling with the cloud.

ICPSR started exploring the Amazon cloud in late 2008, and by early 2009 we had setup our first production service, a stealth authoritative DNS server for our domains, and a replica of our Web service infrastructure. We're primarily uses of the Elastic Computing Cloud (EC2) and Simple Storage Service (S3). We're also looking at CloudFront, but to be honest, I'm not sure we generate the volume of traffic that would make it super useful to us.

Since them we've also launched an instance in Amazon's EU zone for disaster recovery purposes, and launched a handful of new sites on cloud instances rather than local hardware. Here's a complete roster of instances as of today:

Web server replica
Oracle database replica
CCERC Web replica
Solr search replica
Stealth authoritative DNS server (used in our DR process)
Teaching With Data NSDL Pathway production service
Teaching With Data NSDL Pathway staging server
LDAP server supporting TWD
SSDAN next generation production service
SSDAN next generation development and staging server
Content server for deep DR (located in the EU zone)

I think it's likely we'll move some of the replicas we run on behalf of the Minnesota Population Center to the cloud as well.

We've had great experiences with Amazon Web Services (AWS) so far: very low barrier to entry, and a very nice management tool in Elasticfox. The on-going operations have been stable and secure, and the price is right. And while I'm not sure I'd rely solely on the cloud for my archival storage solution, using the cloud to stash away one additional copy is very attractive.

Thursday, November 19, 2009

The ICPSR Pipeline Process

After arriving at ICPSR in 2002 one of the first things Myron asked me to do was to automate the "data processing" function at ICPSR. As I began exploring that business process, it became very clear to me that (1) the process was actually a series of six or more inter-related processes with different owners, inputs, and outputs, and (2) no single person at ICPSR had a crisp, clear understanding of the entire process. And so my dilemma: How to design and build a software system to facilitate a process which isn't documented, and isn't even well understood?

Fortunately a colleague of mine at UUNET agreed to join ICPSR: Cole Whiteman. Cole is very strong at analyzing and documenting business process, and has a particular gift for coaxing out the details and then rendering the system in an easy to understand, easy to read format. I've included the latest "Whiteman" above as a sample of his art.

Cole spent many months interviewing staff, drawing pictures, interviewing more staff, refining pictures, and so on, until he had a picture that both generated agreement - "Yep, that's the process we use!" - and demonstrated bottle-necks. Now the way was clear for automation.

Consequently, ICPSR has invested tremendous resources over the past few years building a collection of inter-connected systems that enable workflow at ICPSR. These workflow systems now form the core business process infrastructure of ICPSR, and give us the capability to support a very high-level of business. When talking to my colleagues at other data archives, my sense is that ICPSR has a unique asset. Here's a thumbnail sketch of the systems.

Deposit Form - This system manages information from the time of its first arrival via upload, until the time the depositor signs the form, transferring custody to ICPSR. The form has the capacity to collect a lot of descriptive metadata at the start of the process, and also automatically generates appropriate preservation metadata upon custody (e.g., fingerprints for each file deposited).
Deposit Viewer - This might be more appropriately named the Deposit Manager since it not only lets ICPSR staff search, browse, and view metadata about deposits, it also enables staff to manage information about deposits. For example, this is the tool we use to assign a deposit to a data manager. We also use this tool to connect deposits to studies.
Metadata Editor - This is the primary environment for creating, revising, and managing descriptive and administrative metadata about a study. Abstracts, subject terms, titles, etc. are all available for management, along with built-in connections to ICPSR business rules that control or limit selections. The system also contains the business logic that controls quality assurance.
Hermes - Our automation tool for producing the ready-to-go formats we deliver on our web site, and variable-level DDI XML for digital preservation. This system takes an SPSS System file as its input, and produces a series of files as output, some of which end up on our Web site for download, and others of which enter our archival storage system.
Turnover - Data managers use this tool to perform quality assurance tests on content which is ready for ingest, and to queue content both for insertion into archival storage and for release on the ICPSR Web site. An accompanying web application enables our release management team to accept well-formed content, and to reject objects which aren't quite ready for ingest.

Wednesday, November 18, 2009

Good to Great

I recently finished reading Good to Great by Jim Collins. I used to read business-oriented books on a more regular basis when I was working for America Online's ANS Communications division and then UUNET, and it was nice returning to that style.

The subtitle of the book is Why Some Companies Make the Leap... and Others Don't. And while the book is focused on the business world, and a common metric of success such as exceeding the average return in the major stock markets, it would be a mistake to think that this book can't teach us about the not-for-profit world that ICPSR occupies.

One tenet of the story the book tells is that organizations often lose their focus, wander into the weeds, and then suffer failure, sometimes catastrophic failure. The successful companies figure out their core mission, keep it simple, and then slowly but inexorably gain momentum to dominate and win. For example, the book contrasts the story of Gillette and Warner-Lambert. While the former focused squarely on its core, Warner-Lambert flailed between different goals, eventually being swallowed up by Pfizer.

The book refers to this type of focus as the Hedgehog Concept and breaks it into three elements:

What you are deeply passionate about
What drives your economic engine
What you can be the best in the world at

My sense is that this is an important message, particularly for successful organizations. It's easy to grow heady with success and start chasing bigger and more diverse deals, losing focus on what led to success.

Another interesting element of successful organizations was their use of "stop-doing" lists. While all organizations keep track of their "to-do" lists, which get longer and longer and longer and ..., the highly successful organizations made a conscious decisions about what to stop doing. This too resonates with me, and my experience is that if organizations don't make the hard decisions about what to stop doing, they end up spreading their resources too thinly, and then nothing gets done well.

A final interesting item I'll note here is how the budget process is described at highly successful organizations. It isn't an opportunity to ration income across a myriad of areas; rather it is an exercise to decide which areas are core and should be funded fully and completely, and which areas are not core, and should be funded not at all. Once again the root message is about focus.

There are many other very interesting observations from the research behind the book, and I'd recommend it to anyone who plays a leadership role at an organization.

Monday, November 16, 2009

ICPSR Content and Availability

Legend:

Blue = Archival Storage
Yellow = Access Holdings
Green = both Archival Storage and Access Holdings
Red Outline = Web-delivered copy of Access Holdings

We're getting close to the one-year anniversary of the worst service outage in (recent?) ICPSR history. On Monday, December 28th, 2008 powerful winds howled through southeastern lower Michigan, knocking out power to many, many thousands of homes and businesses. One business that lost power was ICPSR.

No data was lost, and no equipment was damaged, but ICPSR's machine room went without power nearly until New Year's Day. In many ways we were lucky: The long outage happened during a time when most scholars and other data users are enjoying the holidays, and there was no physical damage to repair. The only "fix" was to power up the equipment once the building had power again.

However, this did serve as a catalyst for ICPSR to focus resources and money on its content delivery system, and therefore on its content replication story too. Some elements of the story below predate the 2008 winter storm, but many of the elements are relatively new.

ICPSR manages two collections of content: archival storage and access holdings.

Archival storage consists of any digital object that we intend to preserve. Examples include original deposits, normalized versions of those deposits, normalized versions of processed datasets, technical documentation in durable formats such as TIFF or plain text, metadata in DDI XML, and so on. If a particular study (collection of content) has been through ICPSR's pipeline process N different types, say due to updates or data resupplies, then there will be N different versions of the content in archival storage.

Access holdings consist of only the latest copy of an object, and often include formats that we do not preserve. For example, while we might preserve only a plain text version of a dataset, we might make the dataset available in contemporary formats such as SPSS, SAS, and Stata to make it easy for researchers to use. Anything in our access holdings would be available for download on our Web site, and therefore doesn't contain confidential or sensitive data. Much of the content, particularly more modern files, would have passed through a rigorous disclosure review process.

The primary location of ICPSR's archival storage is a EMC Celera NS501 Network Attached Storage device. In particular, a multi-TB filesystem created from our pool of SATA drives provides a home for all of our archival holdings.

ICPSR replicates its archival storage in three locations:

San Diego Supercomputer Center (synchronized via the Storage Resource Broker)
MATRIX - The Center for Humane Arts, Letters, & Science Online at Michigan State University (synchronized via rsync)
A tape backup system at the University of Michigan (snapshots)

We are also working on adding a fourth replica at the H. W. Odum Institute for Research in Social Science at the University of North Carolina - Chapel Hill.

Some of our content stored at the San Diego Supercomputer Center - a snapshot in time from 2008 - is also replicated in the Chronopolis Digital Preservation Demonstration Project, and that gives us two additional copies of many objects.

An automated process compares the digital signature of each object in archival storage and compares it to a digital signature calculated "on the fly." If the signatures do not match, the object is flagged for further investigation.

The primary location for ICPSR's access holdings is also the EMC NAS. But in this case, the content is stored on a much smaller filesystem built from our pool of high-speed, FC disk drives.

ICPSR replicates its access holdings in five locations:

San Diego Supercomputer Center (synchronized via the Storage Resource Broker)
A tape backup system at the University of Michigan (snapshots)
A file storage cloud hosted by the University of Michigan's Information Technology Services
An Amazon Web Services (AWS) Elastic Computing Cloud (EC2) instance located in the EU region
An Amazon Web Services (AWS) Elastic Computing Cloud (EC2) instance located in the US region

Only the last replica above contains the necessary software and support systems (e.g., an Oracle database system) to actually deliver ICPSR's content; all of the other systems contain a complete snapshot of our access holdings, but not the platform with which to deliver the content.

The AWS-hosted replica has been used twice so far in 2009. We performed a "lights out" test of the replica in mid-March, and we performed a "live" failover due to another power outage in May. In both cases the replica worked as expected, and the amount of downtime was reduced dramatically.

And, finally, our access holdings and our delivery platform are available on the ICPSR Web staging system. But because the purpose of this system is to stage and test new software and new Web content, this is very much an "emergency only" option for content delivery.

Friday, November 13, 2009

TRAC: C1.7: Refreshing/migrating content

C1.7 Repository has defined processes for storage media and/or hardware change (e.g., refreshing, migration).

The repository should have triggers for initiating action and understanding of how long it will take for storage media migration, or refreshing — copying between media without reformatting the bitstream. Will it finish before the media is dead, for instance? Copying large quantities of data can take a long time and can affect other system performance. It is important that the process includes a check that the copying has happened correctly.

Repositories should also consider the obsolescence of any/all hardware components within the repository system as potential trigger events for migration. Increasingly, long-term, appropriate support for system hardware components is difficult to obtain, exposing repositories to risks and liabilities should they chose to continue to operate the hardware beyond the manufacturer or third-party support.

Evidence: Documentation of processes; policies related to hardware support, maintenance, and replacement; documentation of hardware manufacturers’ expected support life cycles.

ICPSR's archival storage consumes less than 6 TB of storage today. Over the past month we've made copies in other locations, and the time to copy it across a network is anywhere from a day to a week, depending upon the speed of the network. So that's much shorter than the lifespan of the media. :-)

The master copy resides on an EMC Celera NAS. From time to time one of the SATA drives that underpins archival storage will fail, and the Celera will fail over to its hot spare, and make a phone call for EMC to schedule a replacement. And, so in some odd way, the media gets refreshed on an incremental basis slowly over time.

We bought our Celera in 2005, and my expectation is that we'll likely replace it with something else in 2010; 2011 at the very latest. And so it's timely to start thinking about a written procedure for moving the master copy of the content from the Celera to the next storage platform. I don't think it will be a complicated procedure, and putting it together might make for a good future post.

Friday, November 6, 2009

Back to the Fedora: Part 4

This is the final post in the series.

So far we have introduced a pair of Content Model objects: one for social science data, and one for social science data documentation. In this post we introduce a third Content Model object for social science: an aggregate level object that has some content of its own (descriptive metadata and preservation metadata), but serves largely to group together related objects.

The Content Model object is to the left. It must have two Datastreams: one for the descriptive metadata in DDI XML format, and one for preservation metadata in PREMIS XML format. Note that we may discover that we can use DDI for both purposes, and in that case, the PREMIS Datastream will drop out as a required element.

Like past posts, the image to the left is a link to the ICPSR Fedora test repository, and will return the "home page" for the Content Model object pictured.

To the right we have a Fedora data object which conforms to the Content Model above.

Like the Content Model image, this image is also a link to our Fedora test repository, and clicking it will navigate to the matching data object.

This object has one relationship asserted per member object. In this case we assert three hasMember relationships: one for the survey data object; one for the survey documentation object; and, one for the survey instrument object. These correspond to isMemberOf relationships asserted in those objects, and together they assert a series of bilateral relationships.

The object contains the two required Datastreams. In this case the actual XML is somewhat stylized, and may not be "clean" XML. In particular the PREMIS Datastream is very much a work in progress here at ICPSR, and may bear little resemblance to high-quality PREMIS XML.

Thursday, November 5, 2009

SUMIT 2009 followup

This is a follow-up post to my short piece on SUMIT 09, the U-M IT security symposium.

The talk by Moxie Marlinspike was really, really good, and pretty scary. I found a copy of his presentation on the Black Hat site, and while you won't get his commentary by just looking through the deck, you'll definitely come to understand how weak many implementations of SSL are (were?), and how Moxie was able to exploit them. If you have traditionally felt pretty secure when using a web site via SSL, make heavy use of software with automated updates and downloads (like Mozilla), or think you can avoid problems by typing the URL into the address bar v. clicking links on web pages, this will make you reconsider your position.

I also started poking around his web site, thoughtcrime.org, and highly recommend reading some of his stories. I've read all but a few, and most have been pretty interesting. Not at all techie stuff; just good reads.

Wednesday, November 4, 2009

TRAC: C1.6: Reporting and repairing loss

C1.6 Repository reports to its administration all incidents of data corruption or loss, and steps taken to repair/replace corrupt or lost data.

Having effective mechanisms to detect bit corruption and loss within a repository system is critical, but is only one important part of a larger process. As a whole, the repository must record, report, and repair as possible all violations of data integrity. This means the system should be able to notify system administrators of any logged problems. These incidents, recovery actions, and their results must be reported to administrators and should be available.

For example, the repository should document procedures to take when loss or corruption is detected, including standards for measuring the success of recoveries. Any actions taken to repair objects as part of these procedures must be recorded. The nature of this recording must be documented by the repository, and the information must be retrievable when required. This documentation plays a critical role in the measurement of the authenticity and integrity of the data held by the repository.

Evidence: Preservation metadata (e.g., PDI) records; comparison of error logs to reports to administration; escalation procedures related to data loss.

My sense is that this requirement is just about policy as it is process. Fortunately for our data holdings (but unfortunately for TRAC preparation), data loss or corruption is a very infrequent event, and therefore as one might expect, the set of policies and written processes is pretty small.

As a point of comparison, if we look at our policies and processes for handling loss with "working files" we will find a much richer set of policies and systems. We have established infrastructure (an EMC Network Attached Storage (NAS) storage applicance and associated Dell tape management solution); we have internal policies and processes that document how to retrieve lost content; we have external policies that describe which parts of the NAS are written to tape, and the schedule of tape backups; and, we exercise the system on a regular basis as people inadvertently delete or damage files with which they are working actively.

On the Archival Storage side - or even the Access side, where we also look for loss and corruption - the number of data loss or data corruption events is very, very low. Email reports come out on a regular basis, but they always (almost) say that everything is fine. And on that rare occasion where there is an issue, the remedy is quick.

Perhaps the right solution here is to use the small sample of issues that have arisen over the years as our baseline for writing up a process, and then posting that process on our internal web site. That would be easy to do. But then the concern is this: If a policy is used very, very infrequently, it is likely to fall into disrepair. It is also likely to become forgotten. Maybe the tool that examines for loss or corruption should also contain a link to the relevant policies and recovery processes?

What strategies have others used to address this TRAC requirement?

Monday, November 2, 2009

Confidential Data and the Cloud

I have a new post on our NIH Challenge Grant project, but it's in our project blog rather than here.

So for you loyal readers who follow this blog, but not our Challenge Grant blog, here's the link: http://enclavecloud.blogspot.com/2009/11/high-level-system-architecture.html

I'll also be giving a talk on this at the Fall 2009 Coalition for Networked Information (CNI) Membership Meeting. If you're there, please drop by to say hello!

Friday, October 30, 2009

Back to the Fedora: Part 3

This is the penultimate post in this series. The final post will describe an aggregate object ("study") that will contain relatively little content, but which serves as a grouping element for more basic elements.

The object to the left is a conventional Fedora Data Object, but I include it here as an example where we have important content to preserve and deliver, and where the content is somewhat of a "one off" and doesn't conform to a unique Content Model.

In this case we have the survey instrument that was used to collect the data in icpsr:eager-survey-data-25041.

The instrument is available in two different languages (English and Spanish), and while the original deposit was PDF-format, we have also produced a TIFF-format of each version for preservation purposes. This translates into a simple object with four Datastreams, one for each (language, format) combination.

We assert membership to the aggregate "study" object in RELS-EXT. We also assert a connection to the associated dataset using a custom relationship we minted: isInstrumentFor. It isn't clear (yet) if having a specialized relationship such as this will be any more useful than a less descriptive relationship (e.g., isRelatedTo, to make one up).

Wednesday, October 28, 2009

TRAC: C1.5: Detecting corruption and loss

C1.5 Repository has effective mechanisms to detect bit corruption or loss.

The repository must detect data loss accurately to ensure that any losses fall within the tolerances established by policy (see A3.6). Data losses must be detected and detectable regardless of the source of the loss. This applies to all forms and scope of data corruption, including missing objects and corrupt or incorrect or imposter objects, corruption within an object, and copying errors during data migration or synchronization of copies. Ideally, the repository will demonstrate that it has all the AIPs it is supposed to have and no others, and that they and their metadata are uncorrupted.

The approach must be documented and justified and include mechanisms for mitigating such common hazards as hardware failure, human error, and malicious action. Repositories that use well-recognized mechanisms such as MD5 signatures need only recognize their effectiveness and role within the overall approach. But to the extent the repository relies on homegrown schemes, it must provide convincing justification that data loss and corruption are detected within the tolerances established by policy.

Data losses must be detected promptly enough that routine systemic sources of failure, such as hardware failures, are unlikely to accumulate and cause data loss beyond the tolerances established by the repository’s policy or specified in any relevant deposit agreement. For example, consider a repository that maintains a collection on identical primary and backup copies with no other data redundancy mechanism. If the media of the two copies have a measured failure rate of 1% per year and failures are independent, then there is a 0.01% chance that both copies will fail in the same year. If a repository’s policy limits loss to no more than 0.001% of the collection per year, with a goal of course of losing 0%, then the repository would need to confirm media integrity at least every 72 days to achieve an average time to recover of 36 days, or about one tenth of a year. This simplified example illustrates the kind of issues a repository should consider, but the objective is a comprehensive treatment of the sources of data loss and their real-world complexity. Any data that is (temporarily) lost should be recoverable from backups.

Evidence: Documents that specify bit error detection and correction mechanisms used; risk analysis; error reports; threat analyses.

For each object in Archival Storage, ICPSR computes a MD5 hash. This "fingerprint" is then stored as metadata for each object.

Automated jobs "prowl" Archival Storage on a regular basis computing the current MD5 hash for an object, and comparing it to the stored version. In the case where the hashes differ, and exception is generated, and this information is reported to the appropriate staff for diagnosis and correction.

In practice we see very few exceptions such as these, and the most common cause is a blend of human-error and software failing to handle the error gracefully.

Recovery is quick. In the event the problem was caused by human-error, and the ctime (last modified) timestamp has changed, then any copies managed via rsync may also be damaged, and we instead need to fetch the original object from a different source (e.g., tape or a copy managed via SRB's Srsync). In the event the problem was caused without ctime also changing, then we also have the option of fetching an original copy from one of our rsync-managed copies.

Tuesday, October 27, 2009

Exciting News from Amazon

Amazon announced three new offerings in their cloud platform today. All sound very interesting, and all have potential utility to ICPSR.

One, Amazon now offers a bona fide relational database (MySQL-type) in the cloud. They handle the patching, scaling, and other classic DBA functions; you provide the data. We use Oracle heavily today, but make little use of Oracle-only features.

Two, they are now offering "high-memory" instances: High-Memory Double Extra Large Instance 34.2 GB of memory, 13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform ($1.20/hour); and, High-Memory Quadruple Extra Large Instance 68.4 GB of memory, 26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform ($2.40/hour).

Three, they are dropping the price of "on-demand" instances by 15% effective Nov 1. We've switched to reserved instances for some of our long-lived virtual systems, but we still have a handful of on-demand systems, and so this will have an immediate positive impact on our monthly bill.

Definitely a nice "treat" from Amazon this Halloween!

Thursday, October 22, 2009

How To Lose a Customer

I visited the web site of a major domain registry this afternoon, logged in, and saw that ICPSR had zero domains registered with them.

I smiled.

It wasn't always this way. Just a few months ago I registered seven new domains with this company to support our project to build and host a National Science Digital Library Pathway for quantitative social science. These seven domains - teachingwithdata.net is one - joined dozens of others I had registered with them over the years. We were a pretty good customer.

Now, the domain registration game has always seemed like a scam to me. Why it costs $20 or more per year for someone to take information that I enter into a web form, and hand it off to other registraries and DNS root operators, I cannot fathom. Surely this is a business where the profit margins are unconscionably high. And yet I was OK with giving them hundreds of dollars every year for the privilege of entering registry information into their web site.

But then they broke their end of the promise.

They may not have known it, but by charging me these hundreds of dollars and forcing me to use their web site to manage my information, they were establishing a de facto promise: "We will take your money, we will give you poor tools, but in return, we will cause you no harm."

And then they did.

A software developer on my team noticed that the recently registered NSDL domains weren't working. Instead of the root DNS servers delegating the domains to us, they were still listed with the registry's DNS servers. At first I thought that I had screwed up. The tools are pretty bad, and it was certainly possible that as I was attempting to avoid all of the "upgrades" I was being offered ("Private registrations!"), I neglected to click the right series of icons and links to delegate the domains. And so I went back to them and delegated the domains again.

But, by the next morning, my changes had been discarded. Silently.

I tried again. And again, my changes appeared to work, but later were discarded without notice.

I opened up a trouble ticket. I received an auto-reply, and then a follow-up that (1) closed the ticket, and (2) gave me the URL of a web site that I could use to open a trouble ticket. Nice.

And so I did what any reasonable consumer would do: I changed vendors.

To their credit, the registry performed at their very best as I transferred domains away. Sure, the tools were still just as poor, but when they didn't work, they helped me out. No valid Administrative Contact listed in WHOIS despite one being listed with the registry? No apparent way to fix it? No problem, the registry solved the problem in three days. Within a week or two I had transferred away all of our domains.

My new registry is the University of Michigan, which acts as a front-end for Tucows. UMich doesn't make me use any awful web forms, and they even answer the phone when I call. And they don't charge any more than the former registry.

It's enough to make me smile again.

Wednesday, October 21, 2009

TRAC: C1.4: Synchronizing objects

C1.4 Repository has mechanisms in place to ensure any/multiple copies of digital objects are synchronized.

If multiple copies exist, there has to be some way to ensure that intentional changes to an object are propagated to all copies of the object. There must be an element of timeliness to this. It must be possible to know when the synchronization has completed, and ideally to have some estimate beforehand as to how long it will take. Depending whether it is automated or requires manual action (such as the retrieval of copies from off-site storage), the time involved may be seconds or weeks. The duration itself is immaterial—what is important is that there is understanding of how long it will take. There must also be something that addresses what happens while the synchronization is in progress. This has an impact on disaster recovery: what happens if a disaster and an update coincide? If one copy of an object is altered and a disaster occurs while other copies are being updated, it is essential to be able to ensure later that the update is successfully propagated.

Evidence: Workflows; system analysis of how long it takes for copies to synchronize; procedures/documentation of operating procedures related to updates and copy synchronization; procedures/documentation related to whether changes lead to the creation of new copies and how those copies are propagated and/or linked to previous versions.

I think we have a good story to tell.

As new objects enter Archival Storage at ICPSR, they reside in a well-known, special-purpose location. Automated, regularly scheduled system jobs synchronize those objects with remote locations using standard, established tools such as rsync and other, less common tools such as Storage Resource Broker (SRB) command-line utilities, such as Srsync.

The output of these system jobs is captured and delivered nightly to a shared electronic mailbox. The mailbox is reviewed on a daily basis; this task belongs to the member of the ICPSR IT team who is currently on-call. When a report is missing or when a report indicates an error, the problem is escalated to someone who can diagnose and correct the problem. One common problem, for example, occurs when an object larger than 2GB enters Archival Storage and the SRB Srsync utility faults. (SRB limits objects to 2GB.) We then remove this object from the list of items to be synchronized with SRB.

Because the synchronization process is incremental, it has a very short duration. However, if we were to need to synchronize ALL content, this takes on the order of days or even weeks. For example, we recently synchronized a copy of our Access holdings to a computing instance residing in Amazon's EC2 EU-West region, and we found it took approximately one week to copy about 500GB. As another example, we recently synchronized a copy of our Archival Storage (which is much larger than the Access collection) to a system, which like ICPSR and the University of Michigan, is connected to Internet2's Abilene network, and that took far less time.

SUMIT 09 - Annual UMich IT Security Symposium

I attended a very interesting symposium at UMich on Tuesday. It's an annual event called SUMIT, and the focus is on IT-related security. The event includes a series of speakers who have interesting stories to tell, and this year was no exception.

I arrived rather late to the event, and only caught the final part of what appeared to be a very interesting talk by Wade Baker, Verizon Business Security Solutions: Cybercrime: The Actos, Their Actions, and What They're After. Wade's experience has been that data loss is often left undiscovered for five or six months, and often only becomes discovered when that data is used to commit a crime, such as fraud. His sense is that targets are often repositories of information rather than individual systems (e.g., credit companies v. a home PC with information about only a single credit card). He went on to say that most organizations do not know where most of their sensitive data is located; they'll believe that it is located only in areas X and Y, but then discover that someone made a copy in area Z as well. When asked by the audience what single activity is most effective at increasing data security, Wade suggested audits: Organizations often have adequate security policies in place, but all too often they are not followed or enforced, and an audit will reveal this.

The second speaker, Moxie Marlinspike, Institute of Disruptive Technologies, gave a very, very interesting talk entitled Some Tricks for Defeating SSL in Practice. Moxie gave a detailed and clear explanation of a tool he created, sslsniff, and how it can be used in a man-in-the-middle attack to hijack a supposedly secure web connection using SSL. Further, by taking advantage in weak integrity checking by both certificate authorities and certificate-handling software, he demonstrated how one can obtain a "wildcard cert" which allows one to spoof many different web sites. And, as if that isn't scary enough, he also demonstrated how this allows one to inject software onto a machine via automated software-update jobs (e.g., Mozilla's update feature).

The next speaker, Adam Shostack. Microsoft, discussed the economic side of computer security in his talk, New School of Information Security. Adam spoke about how there was a dearth of available data for making decisions about computer security, but that the growing body of "breach data" was improving the situation. Adam pointed to http://datalossdb.org/ as a good example of freely available breach data.

Terry Berg, US Attorney, described the pursuit and resolution of a high-profile case against the spammer, Alan Ralsky, in his talk, To Catch (and Prosecute) a Spammer. In brief, while technology was essential both both perpetrating and later solving the crime, the law enforcement team relied heavily on old-fashioned techniques such as cooperating witnesses to make its case.

The last speaker, Alex Halderman, University of Michigan, discussed a method of defeating secure disk storage through "cold boot" attacks in his talk, Cold-Boot Attacks Against Disk Encryption. It turns out that volatile RAM is not quite so volatile after all, and if one can sufficiently chill a memory chip, one can remove it from a victim PC, install it in a new machine, boot a minimal kernel, and then search the memory for the disk encryption key. Finding the key is easier than one may think because most encryption mechanisms maintain multiple derivatives of the key, which greatly facilitates its theft. The moral of the story is that one should always shutdown a computer or laptop if it contains sensitive data and will be taken through an insecure location (e.g., airport).

Monday, October 19, 2009

Interoperability Between Institutional Data Repositories: a Pilot Project at MIT

Kate McNeill from MIT pointed me to this interesting paper from the IASSIST Quarterly: Interoperability Between Institutional Data Repositories: a Pilot Project at MIT. (PDF format)

As Kate mentioned to me, this paper describes a tool which transformed DDI-format XML into METS, and it would be well worth exploring if this tool could be used in some way to support a deliverable on our EAGER grant: a tool which transforms DDI-format XML into FOXML.

Fedora supports several ingest formats, including METS and its own native FOXML, and so if there is already a tool that generates METS, that would be a good starting point for a FOXML version. Further, an interesting science experiment would be to take DDI, transform it both into METS and FOXML, ingest both objects, and see how if they would differ in any significant manner.

Friday, October 16, 2009

Back to the Fedora: Part 2

To go along with our survey data object, we'll also need a survey documentation object. We'll relate the objects via RDF in the RELS-EXT Datastream, and we'll also relate the documentation object to the higher-level, aggregate object, "social science study." The image to the left is clickable, and will take one to the "home page" for this Content Model object in the ICPSR Fedora test server.

Note that the name of this Content Model object is somewhat of a misnomer. Even though a common use-case is survey data, we may use the same type of object for other social science data that are not survey data, such as government-generated summary statistics about health, crime, demographics, or all sort of other things.

The heart of the Content Model is in the DS-COMPOSITE-MODEL Datastream where we require a large number of Datastreams: a "setups" Datastream for each of the common statistical packages; a DDI XML Datastream that documents the associated survey data object; and a pair of Datastreams for the human-readable technical documentation (the "codebook"). A future refinement might be to replace the pair - one PDF, one TIFF - with a single Datastream which is both durable for preservation purposes, but which also allows the rich display of information (PDF/A?).

At the right we have a data object that conforms to the Content Model object above. Of course, it contains all of the required Datastreams, most of which are stored as simple text files. The DDI is actually a very large bit of XML which is currently being stored in a separate file rather than as in-line XML (i.e., Control Group M rather than Control Group X in the FOXML).

The relationships in the RELS-EXT Datastream are congruent with those in the associated survey data Datastream. Both assert a hasModel relationship to the applicable Content Model, and both assert a isMemberOf relationship to the higher level object that "contains" them. Here, though, we use the isDescriptionOf relationship rather to show that this documentation object is a description of its related survey data object; in that object we asserted a hasDescription relationship to this object.

Of course, there is nothing preventing us from adding additional Datastreams to an object like this when they are available, such as unstructured notes from the original data collector. However, since that content isn't always available, we don't make it a required Datastream in the Content Model.

Clicking the image to the right will take one to its "home page" on the ICPSR Fedora test server. All of the Datastreams are identical to those on the ICPSR web site, except for the TIFF codebook and variable-level DDI, which we usually do not make available.

Wednesday, October 14, 2009

TRAC: C1.3: Managing all objects

C1.3 Repository manages the number and location of copies of all digital objects.

The repository system must be able to identify the number of copies of all stored digital objects, and the location of each object and their copies. This applies to what are intended to be identical copies, not versions of objects or copies. The location must be described such that the object can be located precisely, without ambiguity. It can be an absolute physical location or a logical location within a storage media or a storage subsystem. One way to test this would be to look at a particular object and ask how many copies there are, what they are stored on, and where they are. A repository can have different policies for different classes of objects, depending on factors such as the producer, the information type, or its value. Some repositories may have only one copy (excluding backups) of everything, stored in one place, though this is definitely not recommended. There may be additional identification requirements if the data integrity mechanisms use alternative copies to replace failed copies.

Evidence: random retrieval tests; system test; location register/log of digital objects compared to the expected number and location of copies of particular objects.

Our story here is a mixed bag of successes and barriers.

For the master copy of any object we can easily and quickly specify its location. And for the second (tape) copy, we too can easily specify the location as long as we're not too specific. For example, we can point to the tape library and say, "It's in there." And, of course, with a little more work, we can use our tape management system to point us to the specific tape, and the location on that tape. Maintaining this information outside of the tape management system would be expensive, and it's not clear if there would be any true benefit.

The location of other copies can be derived easily, but those specific locations are not recorded in a database. For example, let's say that the master copy of every original deposit we have is stored in a filesystem hierarchy like /archival-storage/deposits/deposit-id/. And let's say that on a daily basis we synchronize that content via rsync to an off-site location, say, remote-location.icpsr.umich.edu:/archival-storage/deposits/deposit-id/. And let's also say that someone reviews the output of the rsync run on a daily basis, and also performs a random spot-check on an irregular basis.

In this scenario we might have a large degree of confidence that we could find a copy of any given deposit on that off-site location. We know it's there because rsync told us it put it there. But we don't have a central catalog that says that deposit #1234 is stored under /archival-storage/deposits/1234, on tape, and at remote-site.icpsr.umich.edu/archival-storage/deposits/1234. One could build exactly such a catalog, of course, and then create the process to keep it up to date, but would it have much value? What if all we did was tell a wrapper around rsync to capture the output and update the catalog?

Probably not.

And so if we interpret the TRAC requirement to build a location register to mean that we should have a complete, enumerated list of each and every copy, then we don't do so well here. But if we interpret the requirement to mean that we can find a copy by looking on a list (i.e., the catalog proper) or look at a rule (i.e., if the master copy is in location x, then two other copies can be found by applying functions f(x) and g(x)), then we're doing pretty well after all.

Limitations in storage systems also add complexity. For instance, I was once looking at Amazon's S3 as a possible location for items in archival storage. But S3 doesn't let me have objects bigger than 5GB, and since I sometimes have very large files, this means that the record-keeping would be even more complicated. For an object with name X, you can find it in this S3 bucket, unless it is bigger than 5GB, in which case you need to look for N different objects and join them together. Ick.

Monday, October 12, 2009

OR Meeting 2009 - Live Chat with Bryan Beecher and Nancy McGovern

Nancy McGovern and I co-hosted a "live chat" session at this year's meeting for Organizational Representatives (ORs). The video content of this is pretty light - just a few slides I put together to help generate discussion.

You can also find this session - and many more - on the ICPSR web site: http://www.icpsr.umich.edu/icpsrweb/ICPSR/or/ormeet/program/index.jsp.

Sunday, October 11, 2009

ICPSR Technology Job Posting - Senior Software Developer

Job ID:	34780
Job Title:	Software Developer Senior
Job/Career Family:	Information Technology
Job Description and Responsibilities:	Market Title: Software Developer Senior Job/Career Family: Information Technology FLSA: Exempt Salary Range: $70,000 - $85,000 depending on qualifications and experience of selected candidate Hours/Week: 40 Hours Shift/Hours/Days: Regular Business The Inter-university Consortium for Political and Social Research (ICPSR), established in 1962, is an integral part of the international infrastructure of social science research. ICPSR's unique combination of data resources, user support, and training in quantitative methods make it a vital resource for fostering inquiry and furthering the social sciences. ICPSR maintains and provides access to a vast archive of social science data for research and instruction. A unit within the Institute for Social Research at the University of Michigan, ICPSR is a membership-based organization, with over 600 member colleges and universities around the world. A Council of leading scholars and data professionals guides and oversees the activities of ICPSR. ICPSR offers a work environment that is a combination of the best aspects of a small nonprofit or business, established within a university setting. ICPSR is small enough that each person can make a difference, yet large enough to offer a variety of career opportunities. We have a relaxed, collegial atmosphere that fosters communication and networking within and between departments. We are family-friendly, offering flexibility with work hours, and we have a diverse staff that enriches the workplace with their skills and experience. ICPSR offers a competitive total compensation package providing full access to the University of Michigan benefits. More information can be found about ICPSR at www.icpsr.umich.edu. The ICPSR computing environment consists of Windows desktop workstations and UNIX servers. The desktop workstations run typical business applications such as Microsoft Office, but also run statistical software such as SAS and SPSS. The UNIX servers are based on the Intel/Linux platform and include Oracle databases, web server software such as Apache, and a number of other major systems (e.g., tomcat, cocoon). Responsibilities: This position will be responsible for designing relational databases, developing ETL scripts, converting relational data to XML, writing XSLT scripts, configuring Solr/Lucene search indices and indexing jobs, specifying object-relational mapping (ORM) and caching strategies, and developing Java web applications. Additional activities will include coordination of software development activities with other ICPSR development projects; estimation of task level details and associated delivery timeframes; source code control and version management; release management and coordination with ICPSR staff; documentation production and management; training materials production and management; and, software support and trouble-shooting. Finally the person in this position will be expected to freshen, broaden, and deepen their professional and technical skills via regular participation in professional development activities such as training, seminars, and tutorials. NOTE: Part of this job may require some work outside normal working hours to analyze and correct critical problems that arise in ICPSR's 24 hours per day operational environment.
Job Requirements:	Qualifications: -Bachelor Degree in Computer Science or Computer Engineering, or the equivalent education and experience is required -Masters Degree in Computer Science or Computer Engineering is desired -5 or more years of professional software development experience using Java / J2EE -RDBMS vendor (Oracle, Microsoft, or MySQL) certification preferable -Sun Java Developer certification preferable -Extensive knowledge of XML and XSLT is required -Linux systems usage; Windows XP or Vista usage, including common applications such as Word, Excel and Outlook
Department Name:	ICPSR
Org Group:	INST SOC RESEARCH
Campus:	Ann Arbor
Minimum Salary:	0
Maximum Salary:	0
Salary Frequency:	Annual
PTO:
Job Type:	Regular
Full Time:	Yes
Date Posted:	Oct 09 2009
Employee Referral Bonus:
Position Level:
City:	Ann Arbor
State/Province:	Michigan
Country:	United States of America
Postal Code:	48106
Area Code:	734

Friday, October 9, 2009

Back to the Fedora: Part 1

Now that the NSF EAGER grant has arrived, it's time to get restarted on Fedora. We'll start this iteration with a trio of Content Model objects, and kick it off with the first one in this post.

The first - displayed in a clickable, linked, visual format to the left - is a Content Model object for social science survey data. In addition to the objectProperties and the required Datastreams (AUDIT, DC, RELS-EXT), there is also the standard DS-COMPOSITE-MODEL Datastream found in Content Model objects.

For our purposes we'll require each object that purports to conform to a social science survey data object to have three required Datastreams: ORIGINAL, for original survey data that was supplied by the depositor; NORMALIZED, for a plain text version of the file that repository prepares; and, TRANSFORM, which is a record that describes how the ORIGINAL became the NORMALIZED. This last Datastream is typically constructed as an SPSS Setups file at ICPSR, and internally it is often referred to as the "processing history" file. It contains the roadmap of how to move between the two versions of the data.

It may also be the case that we have other Datastreams, perhaps items that will only receive bitwise digital preservation, such as original deposits in SAS or SPSS format. And, in practice, we might want to use Fedora's XACML mechanism to restrict access to the ORIGINAL Datastream since it could contain confidential information.

To the right we have a sample Fedora data object that asserts conformance with our Content Model object above. Like the one above it is also clickable, and will take you to the Fedora repository server ICPSR is using for testing.

In addition to the hasModel relationship, this object also asserts that it is a member of a higher-level object (ICPSR Study 25041), and is described by another object (which we'll look at in the next post).

As required to validate against the Content Model, it has the three required Datastreams. In this particular case, rather than including the original data and processing history transform, I've simply copied the NORMALIZED Datastream content verbatim into the other two Datastreams.

Not shown in the schematic to the right are other possible. optional Datastreams we could include. For instance, it looks like this object was derived from a deposit that began its life at ICPSR as a SAS Transport file. It would certainly be possible to include that as another Datastream that would have value for a limited period of time. Or, another approach would be to collect the deposited items in their own set of Fedora objects, and then assert a relationship to them in the RELS-EXT section.

Next up in this series: the Content Model for technical documentation.

Thursday, October 8, 2009

Cold, Dark, and Lonely: An Archive Moves On-Line.

Carol Minton Morris of DuraSpace called me the other day with some good news: She told me that a short piece I wrote about Fedora and ICPSR would be published in their blog. The piece is called Cold, Dark, and Lonely: An Archive Moves On-Line.

While my colleagues at ICPSR have been alarmed by the title and suggested I seek immediate therapy for what must be an overwhelming foreboding of dread, the title was actually a poor riff on Thomas Friedman's Hot, Flat and Crowded tag. At least I think it was. (Maybe I should make that call after all.....)

Carol has also invited me to participate in the Sun/DuraSpace/SPARC webinar next week, All About Repositories. Should be a lot of fun!

Wednesday, October 7, 2009

ICPSR Job Posting in Technology - Cloud Computing Developer

We've posted the following position on the U-M employment site. (The site is just awful, but don't let that scare you off.)

We've listed it as a two year appointment to match the NIH Challenge Grant, but we've had a lot of success keeping staff employed quite happily and busily by generating more and more grant activity.

Job ID:	34671
Job Title:	Cloud Computing Developer
Job/Career Family:	Information Technology
Country:	United States of America
State:	Michigan
City:	Ann Arbor
Job Type:	Regular
Full Time:	Yes
Date Posted:	Oct 07 2009
Minimum Salary:	0
Maximum Salary:	0
Salary Frequency:	Annual
Job Description and Responsibilities:	Market Title: Systems Analyst Senior Working Title: Cloud Computing Developer FLSA: Exempt Salary Range: $70,000 - $80,000 depending on qualifications and experience of selected candidate Hours/Week: 40 Hours Shift/Hours/Days: Regular Business Please note this is a two year term limited appointment The Inter-university Consortium for Political and Social Research (ICPSR), established in 1962, is an integral part of the international infrastructure of social science research. ICPSR's unique combination of data resources, user support, and training in quantitative methods make it a vital resource for fostering inquiry and furthering the social sciences. ICPSR maintains and provides access to a vast archive of social science data for research and instruction. A unit within the Institute for Social Research at the University of Michigan, ICPSR is a membership-based organization, with over 600 member colleges and universities around the world. A Council of leading scholars and data professionals guides and oversees the activities of ICPSR. ICPSR offers a work environment that is a combination of the best aspects of a small nonprofit or business, established within a university setting. ICPSR is small enough that each person can make a difference, yet large enough to offer a variety of career opportunities. We have a relaxed, collegial atmosphere that fosters communication and networking within and between departments. We are family-friendly, offering flexibility with work hours, and we have a diverse staff that enriches the workplace with their skills and experience. ICPSR offers a competitive total compensation package providing full access to the University of Michigan benefits. More information can be found about ICPSR at www.icpsr.umich.edu. The ICPSR computing environment consists of Windows desktop workstations and UNIX servers. The desktop workstations run typical business applications such as Microsoft Office, but also run statistical software such as SAS and SPSS. The UNIX servers are based on the Intel/Linux platform and include Oracle databases, World Wide Web server software such as Apache, and a number of other major systems (e.g., tomcat, cocoon). Responsibilities Build a prototype secure data computing environment using public utility computing (as provided by the Amazon Elastic Computing Cloud's EC2) at the Inter University Consortium for Social and Political Research (ICPSR) that will provision an analytic computing instance that conforms to the underlying security requirements for data distributed under restricted use agreements and meets the analytic needs of end users and their research teams. Test the performance, security and usability of both the provisioning infrastructure and the analytic computing interface. Standard methods of testing system performance and security will be used as well as independent security assessments through white hat hacking. This position reports to the Assistant Director, Computer & Network Services, ICPSR, but projects will be assigned and priorities designated by the Project Principal Investigator. Note: Part of this job may require some work outside normal working hours to analyze and correct critical problems that arise in ICPSR's 24 hours per day operational environment. Duties -Works with technical staff to design, implement, and support cloud based and/or virtualized computing platforms for both internal and external users. -Creates automated process requiring little manual input for the creation of virtualized computer instances, user accounts, and data access. -Analyzes, proposes and designs the implementation of security interfaces in systems, applications, and network software. -Participates in the evaluation of proposed systems, applications, and network software to determine security, data integrity, and usability implications. Assess risks to data security and identify countermeasures, plan and implement technologies. -Provides third-level technical support for desktop and network systems, both virtualized and non-virtualized.
Job Requirements:	-Bachelor's degree in computer science, information systems, or equivalent combination of education and experience. -Experience with Cloud computing and provisioning (preferably with Amazon Elastic Computing Cloud). - 4+ years experience with collecting and documenting business requirements from users; then researching, designing, implementing, and supporting computing systems to meet those requirements. - 5+ years experience and expertise with installing, configuring, and programming Windows software in both virtualized and non-virtualized settings. -Experience and expertise in industry security training, such as SANS GIAC, or have work experience in security consulting or network security. -Experience with social science concepts, social science data, and analysis methods, and statistical applications (SAS, SPSS, etc) preferred. -Experience with Wise Package Studio or other MSI-packaging software preferred. -Ability to explain complex technical concepts to non-technical users and stakeholders. -Excellent customer service skills and customer-oriented focus. -Attentiveness to detail. -Excellent writing skills (writing samples will be required) -Ability to work independently while meeting deadlines, communicating issues, and providing detailed project status updates. -Ability to work within a team.

TRAC: C1.2: Backup infrastructure

C1.2 Repository ensures that it has adequate hardware and software support for backup functionality sufficient for the repository’s services and for the data held, e.g., metadata associated with access controls, repository main content.

The repository needs to be able to demonstrate the adequacy of the processes, hardware and software for its backup systems. Some will need much more elaborate backup plans than others.

Evidence: Documentation of what is being backed up and how often; audit log/inventory of backups; Trustworthy Repositories Audit & Certification: Criteria and Checklist validation of completed backups; disaster recovery plan—policy and documentation; “firedrills”—testing of backups; support contracts for hardware and software for backup mechanisms.

ICPSR has extensive documentation and infrastructure to support its core access functions even when a catastrophic failure disables its primary location in Ann Arbor, Michigan. The documentation - planning documents and instructions - reside in a Google Group, and all members of the IT team, and two of ICPSR's senior staff outside of IT are members of the group. The process has been used twice in 2009, once as a test, and once when the Ann Arbor site suffered a power failure.

ICPSR has a less well documented, but fairly prosaic, backup solution in place. All non-ephemeral content at ICPSR resides on a large Network Attached Storage (NAS) appliance. The IT team has configured the NAS to "checkpoint" each filesystem once per day, and each checkpoint is retained for 30 days. Checkpoints provide a read-only, self-serve backup system for those instances where a member of the staff has inadvertently damaged or destroyed non-archival content.

Further, we write all filesystems to a tape library, which is located in a different machine room than the NAS. Every two weeks tapes are removed from the tape library, and stored in yet a different building. We retain the last four weekly backups, and the last twelve monthly backups. The system is exercised on an infrequent, but regular basis when we restore files that were damaged or destroyed beyond the thirty day checkpoint window.

Finally, unlike "working" files where all copies reside locally, and where we retain only one year of content, our archival storage solution consists of copies in at least four locations. The master copy (1) is on the NAS; a copy (2) is written to tape each month; a copy (3) is synchronized daily with the San Diego Supercomputer Center's storage grid; and, a copy (4) is synchronized daily with the MATRIX Center at Michigan State University. Furthermore, archival content collected prior to 2009 has also been copied into the Chronopolis project storage grid, which adds two additional copies.

One area with room for improvement would be regular "fire drills" where we would attempt to retrieve a random number of random objects from an arbitrarily selected archival storage location.