Technology at ICPSR: October 2009

Friday, October 30, 2009

Back to the Fedora: Part 3

This is the penultimate post in this series. The final post will describe an aggregate object ("study") that will contain relatively little content, but which serves as a grouping element for more basic elements.

The object to the left is a conventional Fedora Data Object, but I include it here as an example where we have important content to preserve and deliver, and where the content is somewhat of a "one off" and doesn't conform to a unique Content Model.

In this case we have the survey instrument that was used to collect the data in icpsr:eager-survey-data-25041.

The instrument is available in two different languages (English and Spanish), and while the original deposit was PDF-format, we have also produced a TIFF-format of each version for preservation purposes. This translates into a simple object with four Datastreams, one for each (language, format) combination.

We assert membership to the aggregate "study" object in RELS-EXT. We also assert a connection to the associated dataset using a custom relationship we minted: isInstrumentFor. It isn't clear (yet) if having a specialized relationship such as this will be any more useful than a less descriptive relationship (e.g., isRelatedTo, to make one up).

Wednesday, October 28, 2009

TRAC: C1.5: Detecting corruption and loss

C1.5 Repository has effective mechanisms to detect bit corruption or loss.

The repository must detect data loss accurately to ensure that any losses fall within the tolerances established by policy (see A3.6). Data losses must be detected and detectable regardless of the source of the loss. This applies to all forms and scope of data corruption, including missing objects and corrupt or incorrect or imposter objects, corruption within an object, and copying errors during data migration or synchronization of copies. Ideally, the repository will demonstrate that it has all the AIPs it is supposed to have and no others, and that they and their metadata are uncorrupted.

The approach must be documented and justified and include mechanisms for mitigating such common hazards as hardware failure, human error, and malicious action. Repositories that use well-recognized mechanisms such as MD5 signatures need only recognize their effectiveness and role within the overall approach. But to the extent the repository relies on homegrown schemes, it must provide convincing justification that data loss and corruption are detected within the tolerances established by policy.

Data losses must be detected promptly enough that routine systemic sources of failure, such as hardware failures, are unlikely to accumulate and cause data loss beyond the tolerances established by the repository’s policy or specified in any relevant deposit agreement. For example, consider a repository that maintains a collection on identical primary and backup copies with no other data redundancy mechanism. If the media of the two copies have a measured failure rate of 1% per year and failures are independent, then there is a 0.01% chance that both copies will fail in the same year. If a repository’s policy limits loss to no more than 0.001% of the collection per year, with a goal of course of losing 0%, then the repository would need to confirm media integrity at least every 72 days to achieve an average time to recover of 36 days, or about one tenth of a year. This simplified example illustrates the kind of issues a repository should consider, but the objective is a comprehensive treatment of the sources of data loss and their real-world complexity. Any data that is (temporarily) lost should be recoverable from backups.

Evidence: Documents that specify bit error detection and correction mechanisms used; risk analysis; error reports; threat analyses.

For each object in Archival Storage, ICPSR computes a MD5 hash. This "fingerprint" is then stored as metadata for each object.

Automated jobs "prowl" Archival Storage on a regular basis computing the current MD5 hash for an object, and comparing it to the stored version. In the case where the hashes differ, and exception is generated, and this information is reported to the appropriate staff for diagnosis and correction.

In practice we see very few exceptions such as these, and the most common cause is a blend of human-error and software failing to handle the error gracefully.

Recovery is quick. In the event the problem was caused by human-error, and the ctime (last modified) timestamp has changed, then any copies managed via rsync may also be damaged, and we instead need to fetch the original object from a different source (e.g., tape or a copy managed via SRB's Srsync). In the event the problem was caused without ctime also changing, then we also have the option of fetching an original copy from one of our rsync-managed copies.

Tuesday, October 27, 2009

Exciting News from Amazon

Amazon announced three new offerings in their cloud platform today. All sound very interesting, and all have potential utility to ICPSR.

One, Amazon now offers a bona fide relational database (MySQL-type) in the cloud. They handle the patching, scaling, and other classic DBA functions; you provide the data. We use Oracle heavily today, but make little use of Oracle-only features.

Two, they are now offering "high-memory" instances: High-Memory Double Extra Large Instance 34.2 GB of memory, 13 EC2 Compute Units (4 virtual cores with 3.25 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform ($1.20/hour); and, High-Memory Quadruple Extra Large Instance 68.4 GB of memory, 26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform ($2.40/hour).

Three, they are dropping the price of "on-demand" instances by 15% effective Nov 1. We've switched to reserved instances for some of our long-lived virtual systems, but we still have a handful of on-demand systems, and so this will have an immediate positive impact on our monthly bill.

Definitely a nice "treat" from Amazon this Halloween!

Thursday, October 22, 2009

How To Lose a Customer

I visited the web site of a major domain registry this afternoon, logged in, and saw that ICPSR had zero domains registered with them.

I smiled.

It wasn't always this way. Just a few months ago I registered seven new domains with this company to support our project to build and host a National Science Digital Library Pathway for quantitative social science. These seven domains - teachingwithdata.net is one - joined dozens of others I had registered with them over the years. We were a pretty good customer.

Now, the domain registration game has always seemed like a scam to me. Why it costs $20 or more per year for someone to take information that I enter into a web form, and hand it off to other registraries and DNS root operators, I cannot fathom. Surely this is a business where the profit margins are unconscionably high. And yet I was OK with giving them hundreds of dollars every year for the privilege of entering registry information into their web site.

But then they broke their end of the promise.

They may not have known it, but by charging me these hundreds of dollars and forcing me to use their web site to manage my information, they were establishing a de facto promise: "We will take your money, we will give you poor tools, but in return, we will cause you no harm."

And then they did.

A software developer on my team noticed that the recently registered NSDL domains weren't working. Instead of the root DNS servers delegating the domains to us, they were still listed with the registry's DNS servers. At first I thought that I had screwed up. The tools are pretty bad, and it was certainly possible that as I was attempting to avoid all of the "upgrades" I was being offered ("Private registrations!"), I neglected to click the right series of icons and links to delegate the domains. And so I went back to them and delegated the domains again.

But, by the next morning, my changes had been discarded. Silently.

I tried again. And again, my changes appeared to work, but later were discarded without notice.

I opened up a trouble ticket. I received an auto-reply, and then a follow-up that (1) closed the ticket, and (2) gave me the URL of a web site that I could use to open a trouble ticket. Nice.

And so I did what any reasonable consumer would do: I changed vendors.

To their credit, the registry performed at their very best as I transferred domains away. Sure, the tools were still just as poor, but when they didn't work, they helped me out. No valid Administrative Contact listed in WHOIS despite one being listed with the registry? No apparent way to fix it? No problem, the registry solved the problem in three days. Within a week or two I had transferred away all of our domains.

My new registry is the University of Michigan, which acts as a front-end for Tucows. UMich doesn't make me use any awful web forms, and they even answer the phone when I call. And they don't charge any more than the former registry.

It's enough to make me smile again.

Wednesday, October 21, 2009

TRAC: C1.4: Synchronizing objects

C1.4 Repository has mechanisms in place to ensure any/multiple copies of digital objects are synchronized.

If multiple copies exist, there has to be some way to ensure that intentional changes to an object are propagated to all copies of the object. There must be an element of timeliness to this. It must be possible to know when the synchronization has completed, and ideally to have some estimate beforehand as to how long it will take. Depending whether it is automated or requires manual action (such as the retrieval of copies from off-site storage), the time involved may be seconds or weeks. The duration itself is immaterial—what is important is that there is understanding of how long it will take. There must also be something that addresses what happens while the synchronization is in progress. This has an impact on disaster recovery: what happens if a disaster and an update coincide? If one copy of an object is altered and a disaster occurs while other copies are being updated, it is essential to be able to ensure later that the update is successfully propagated.

Evidence: Workflows; system analysis of how long it takes for copies to synchronize; procedures/documentation of operating procedures related to updates and copy synchronization; procedures/documentation related to whether changes lead to the creation of new copies and how those copies are propagated and/or linked to previous versions.

I think we have a good story to tell.

As new objects enter Archival Storage at ICPSR, they reside in a well-known, special-purpose location. Automated, regularly scheduled system jobs synchronize those objects with remote locations using standard, established tools such as rsync and other, less common tools such as Storage Resource Broker (SRB) command-line utilities, such as Srsync.

The output of these system jobs is captured and delivered nightly to a shared electronic mailbox. The mailbox is reviewed on a daily basis; this task belongs to the member of the ICPSR IT team who is currently on-call. When a report is missing or when a report indicates an error, the problem is escalated to someone who can diagnose and correct the problem. One common problem, for example, occurs when an object larger than 2GB enters Archival Storage and the SRB Srsync utility faults. (SRB limits objects to 2GB.) We then remove this object from the list of items to be synchronized with SRB.

Because the synchronization process is incremental, it has a very short duration. However, if we were to need to synchronize ALL content, this takes on the order of days or even weeks. For example, we recently synchronized a copy of our Access holdings to a computing instance residing in Amazon's EC2 EU-West region, and we found it took approximately one week to copy about 500GB. As another example, we recently synchronized a copy of our Archival Storage (which is much larger than the Access collection) to a system, which like ICPSR and the University of Michigan, is connected to Internet2's Abilene network, and that took far less time.

SUMIT 09 - Annual UMich IT Security Symposium

I attended a very interesting symposium at UMich on Tuesday. It's an annual event called SUMIT, and the focus is on IT-related security. The event includes a series of speakers who have interesting stories to tell, and this year was no exception.

I arrived rather late to the event, and only caught the final part of what appeared to be a very interesting talk by Wade Baker, Verizon Business Security Solutions: Cybercrime: The Actos, Their Actions, and What They're After. Wade's experience has been that data loss is often left undiscovered for five or six months, and often only becomes discovered when that data is used to commit a crime, such as fraud. His sense is that targets are often repositories of information rather than individual systems (e.g., credit companies v. a home PC with information about only a single credit card). He went on to say that most organizations do not know where most of their sensitive data is located; they'll believe that it is located only in areas X and Y, but then discover that someone made a copy in area Z as well. When asked by the audience what single activity is most effective at increasing data security, Wade suggested audits: Organizations often have adequate security policies in place, but all too often they are not followed or enforced, and an audit will reveal this.

The second speaker, Moxie Marlinspike, Institute of Disruptive Technologies, gave a very, very interesting talk entitled Some Tricks for Defeating SSL in Practice. Moxie gave a detailed and clear explanation of a tool he created, sslsniff, and how it can be used in a man-in-the-middle attack to hijack a supposedly secure web connection using SSL. Further, by taking advantage in weak integrity checking by both certificate authorities and certificate-handling software, he demonstrated how one can obtain a "wildcard cert" which allows one to spoof many different web sites. And, as if that isn't scary enough, he also demonstrated how this allows one to inject software onto a machine via automated software-update jobs (e.g., Mozilla's update feature).

The next speaker, Adam Shostack. Microsoft, discussed the economic side of computer security in his talk, New School of Information Security. Adam spoke about how there was a dearth of available data for making decisions about computer security, but that the growing body of "breach data" was improving the situation. Adam pointed to http://datalossdb.org/ as a good example of freely available breach data.

Terry Berg, US Attorney, described the pursuit and resolution of a high-profile case against the spammer, Alan Ralsky, in his talk, To Catch (and Prosecute) a Spammer. In brief, while technology was essential both both perpetrating and later solving the crime, the law enforcement team relied heavily on old-fashioned techniques such as cooperating witnesses to make its case.

The last speaker, Alex Halderman, University of Michigan, discussed a method of defeating secure disk storage through "cold boot" attacks in his talk, Cold-Boot Attacks Against Disk Encryption. It turns out that volatile RAM is not quite so volatile after all, and if one can sufficiently chill a memory chip, one can remove it from a victim PC, install it in a new machine, boot a minimal kernel, and then search the memory for the disk encryption key. Finding the key is easier than one may think because most encryption mechanisms maintain multiple derivatives of the key, which greatly facilitates its theft. The moral of the story is that one should always shutdown a computer or laptop if it contains sensitive data and will be taken through an insecure location (e.g., airport).

Monday, October 19, 2009

Interoperability Between Institutional Data Repositories: a Pilot Project at MIT

Kate McNeill from MIT pointed me to this interesting paper from the IASSIST Quarterly: Interoperability Between Institutional Data Repositories: a Pilot Project at MIT. (PDF format)

As Kate mentioned to me, this paper describes a tool which transformed DDI-format XML into METS, and it would be well worth exploring if this tool could be used in some way to support a deliverable on our EAGER grant: a tool which transforms DDI-format XML into FOXML.

Fedora supports several ingest formats, including METS and its own native FOXML, and so if there is already a tool that generates METS, that would be a good starting point for a FOXML version. Further, an interesting science experiment would be to take DDI, transform it both into METS and FOXML, ingest both objects, and see how if they would differ in any significant manner.

Friday, October 16, 2009

Back to the Fedora: Part 2

To go along with our survey data object, we'll also need a survey documentation object. We'll relate the objects via RDF in the RELS-EXT Datastream, and we'll also relate the documentation object to the higher-level, aggregate object, "social science study." The image to the left is clickable, and will take one to the "home page" for this Content Model object in the ICPSR Fedora test server.

Note that the name of this Content Model object is somewhat of a misnomer. Even though a common use-case is survey data, we may use the same type of object for other social science data that are not survey data, such as government-generated summary statistics about health, crime, demographics, or all sort of other things.

The heart of the Content Model is in the DS-COMPOSITE-MODEL Datastream where we require a large number of Datastreams: a "setups" Datastream for each of the common statistical packages; a DDI XML Datastream that documents the associated survey data object; and a pair of Datastreams for the human-readable technical documentation (the "codebook"). A future refinement might be to replace the pair - one PDF, one TIFF - with a single Datastream which is both durable for preservation purposes, but which also allows the rich display of information (PDF/A?).

At the right we have a data object that conforms to the Content Model object above. Of course, it contains all of the required Datastreams, most of which are stored as simple text files. The DDI is actually a very large bit of XML which is currently being stored in a separate file rather than as in-line XML (i.e., Control Group M rather than Control Group X in the FOXML).

The relationships in the RELS-EXT Datastream are congruent with those in the associated survey data Datastream. Both assert a hasModel relationship to the applicable Content Model, and both assert a isMemberOf relationship to the higher level object that "contains" them. Here, though, we use the isDescriptionOf relationship rather to show that this documentation object is a description of its related survey data object; in that object we asserted a hasDescription relationship to this object.

Of course, there is nothing preventing us from adding additional Datastreams to an object like this when they are available, such as unstructured notes from the original data collector. However, since that content isn't always available, we don't make it a required Datastream in the Content Model.

Clicking the image to the right will take one to its "home page" on the ICPSR Fedora test server. All of the Datastreams are identical to those on the ICPSR web site, except for the TIFF codebook and variable-level DDI, which we usually do not make available.

Wednesday, October 14, 2009

TRAC: C1.3: Managing all objects

C1.3 Repository manages the number and location of copies of all digital objects.

The repository system must be able to identify the number of copies of all stored digital objects, and the location of each object and their copies. This applies to what are intended to be identical copies, not versions of objects or copies. The location must be described such that the object can be located precisely, without ambiguity. It can be an absolute physical location or a logical location within a storage media or a storage subsystem. One way to test this would be to look at a particular object and ask how many copies there are, what they are stored on, and where they are. A repository can have different policies for different classes of objects, depending on factors such as the producer, the information type, or its value. Some repositories may have only one copy (excluding backups) of everything, stored in one place, though this is definitely not recommended. There may be additional identification requirements if the data integrity mechanisms use alternative copies to replace failed copies.

Evidence: random retrieval tests; system test; location register/log of digital objects compared to the expected number and location of copies of particular objects.

Our story here is a mixed bag of successes and barriers.

For the master copy of any object we can easily and quickly specify its location. And for the second (tape) copy, we too can easily specify the location as long as we're not too specific. For example, we can point to the tape library and say, "It's in there." And, of course, with a little more work, we can use our tape management system to point us to the specific tape, and the location on that tape. Maintaining this information outside of the tape management system would be expensive, and it's not clear if there would be any true benefit.

The location of other copies can be derived easily, but those specific locations are not recorded in a database. For example, let's say that the master copy of every original deposit we have is stored in a filesystem hierarchy like /archival-storage/deposits/deposit-id/. And let's say that on a daily basis we synchronize that content via rsync to an off-site location, say, remote-location.icpsr.umich.edu:/archival-storage/deposits/deposit-id/. And let's also say that someone reviews the output of the rsync run on a daily basis, and also performs a random spot-check on an irregular basis.

In this scenario we might have a large degree of confidence that we could find a copy of any given deposit on that off-site location. We know it's there because rsync told us it put it there. But we don't have a central catalog that says that deposit #1234 is stored under /archival-storage/deposits/1234, on tape, and at remote-site.icpsr.umich.edu/archival-storage/deposits/1234. One could build exactly such a catalog, of course, and then create the process to keep it up to date, but would it have much value? What if all we did was tell a wrapper around rsync to capture the output and update the catalog?

Probably not.

And so if we interpret the TRAC requirement to build a location register to mean that we should have a complete, enumerated list of each and every copy, then we don't do so well here. But if we interpret the requirement to mean that we can find a copy by looking on a list (i.e., the catalog proper) or look at a rule (i.e., if the master copy is in location x, then two other copies can be found by applying functions f(x) and g(x)), then we're doing pretty well after all.

Limitations in storage systems also add complexity. For instance, I was once looking at Amazon's S3 as a possible location for items in archival storage. But S3 doesn't let me have objects bigger than 5GB, and since I sometimes have very large files, this means that the record-keeping would be even more complicated. For an object with name X, you can find it in this S3 bucket, unless it is bigger than 5GB, in which case you need to look for N different objects and join them together. Ick.

Monday, October 12, 2009

OR Meeting 2009 - Live Chat with Bryan Beecher and Nancy McGovern

Nancy McGovern and I co-hosted a "live chat" session at this year's meeting for Organizational Representatives (ORs). The video content of this is pretty light - just a few slides I put together to help generate discussion.

You can also find this session - and many more - on the ICPSR web site: http://www.icpsr.umich.edu/icpsrweb/ICPSR/or/ormeet/program/index.jsp.

Sunday, October 11, 2009

ICPSR Technology Job Posting - Senior Software Developer

Job ID:	34780
Job Title:	Software Developer Senior
Job/Career Family:	Information Technology
Job Description and Responsibilities:	Market Title: Software Developer Senior Job/Career Family: Information Technology FLSA: Exempt Salary Range: $70,000 - $85,000 depending on qualifications and experience of selected candidate Hours/Week: 40 Hours Shift/Hours/Days: Regular Business The Inter-university Consortium for Political and Social Research (ICPSR), established in 1962, is an integral part of the international infrastructure of social science research. ICPSR's unique combination of data resources, user support, and training in quantitative methods make it a vital resource for fostering inquiry and furthering the social sciences. ICPSR maintains and provides access to a vast archive of social science data for research and instruction. A unit within the Institute for Social Research at the University of Michigan, ICPSR is a membership-based organization, with over 600 member colleges and universities around the world. A Council of leading scholars and data professionals guides and oversees the activities of ICPSR. ICPSR offers a work environment that is a combination of the best aspects of a small nonprofit or business, established within a university setting. ICPSR is small enough that each person can make a difference, yet large enough to offer a variety of career opportunities. We have a relaxed, collegial atmosphere that fosters communication and networking within and between departments. We are family-friendly, offering flexibility with work hours, and we have a diverse staff that enriches the workplace with their skills and experience. ICPSR offers a competitive total compensation package providing full access to the University of Michigan benefits. More information can be found about ICPSR at www.icpsr.umich.edu. The ICPSR computing environment consists of Windows desktop workstations and UNIX servers. The desktop workstations run typical business applications such as Microsoft Office, but also run statistical software such as SAS and SPSS. The UNIX servers are based on the Intel/Linux platform and include Oracle databases, web server software such as Apache, and a number of other major systems (e.g., tomcat, cocoon). Responsibilities: This position will be responsible for designing relational databases, developing ETL scripts, converting relational data to XML, writing XSLT scripts, configuring Solr/Lucene search indices and indexing jobs, specifying object-relational mapping (ORM) and caching strategies, and developing Java web applications. Additional activities will include coordination of software development activities with other ICPSR development projects; estimation of task level details and associated delivery timeframes; source code control and version management; release management and coordination with ICPSR staff; documentation production and management; training materials production and management; and, software support and trouble-shooting. Finally the person in this position will be expected to freshen, broaden, and deepen their professional and technical skills via regular participation in professional development activities such as training, seminars, and tutorials. NOTE: Part of this job may require some work outside normal working hours to analyze and correct critical problems that arise in ICPSR's 24 hours per day operational environment.
Job Requirements:	Qualifications: -Bachelor Degree in Computer Science or Computer Engineering, or the equivalent education and experience is required -Masters Degree in Computer Science or Computer Engineering is desired -5 or more years of professional software development experience using Java / J2EE -RDBMS vendor (Oracle, Microsoft, or MySQL) certification preferable -Sun Java Developer certification preferable -Extensive knowledge of XML and XSLT is required -Linux systems usage; Windows XP or Vista usage, including common applications such as Word, Excel and Outlook
Department Name:	ICPSR
Org Group:	INST SOC RESEARCH
Campus:	Ann Arbor
Minimum Salary:	0
Maximum Salary:	0
Salary Frequency:	Annual
PTO:
Job Type:	Regular
Full Time:	Yes
Date Posted:	Oct 09 2009
Employee Referral Bonus:
Position Level:
City:	Ann Arbor
State/Province:	Michigan
Country:	United States of America
Postal Code:	48106
Area Code:	734

Friday, October 9, 2009

Back to the Fedora: Part 1

Now that the NSF EAGER grant has arrived, it's time to get restarted on Fedora. We'll start this iteration with a trio of Content Model objects, and kick it off with the first one in this post.

The first - displayed in a clickable, linked, visual format to the left - is a Content Model object for social science survey data. In addition to the objectProperties and the required Datastreams (AUDIT, DC, RELS-EXT), there is also the standard DS-COMPOSITE-MODEL Datastream found in Content Model objects.

For our purposes we'll require each object that purports to conform to a social science survey data object to have three required Datastreams: ORIGINAL, for original survey data that was supplied by the depositor; NORMALIZED, for a plain text version of the file that repository prepares; and, TRANSFORM, which is a record that describes how the ORIGINAL became the NORMALIZED. This last Datastream is typically constructed as an SPSS Setups file at ICPSR, and internally it is often referred to as the "processing history" file. It contains the roadmap of how to move between the two versions of the data.

It may also be the case that we have other Datastreams, perhaps items that will only receive bitwise digital preservation, such as original deposits in SAS or SPSS format. And, in practice, we might want to use Fedora's XACML mechanism to restrict access to the ORIGINAL Datastream since it could contain confidential information.

To the right we have a sample Fedora data object that asserts conformance with our Content Model object above. Like the one above it is also clickable, and will take you to the Fedora repository server ICPSR is using for testing.

In addition to the hasModel relationship, this object also asserts that it is a member of a higher-level object (ICPSR Study 25041), and is described by another object (which we'll look at in the next post).

As required to validate against the Content Model, it has the three required Datastreams. In this particular case, rather than including the original data and processing history transform, I've simply copied the NORMALIZED Datastream content verbatim into the other two Datastreams.

Not shown in the schematic to the right are other possible. optional Datastreams we could include. For instance, it looks like this object was derived from a deposit that began its life at ICPSR as a SAS Transport file. It would certainly be possible to include that as another Datastream that would have value for a limited period of time. Or, another approach would be to collect the deposited items in their own set of Fedora objects, and then assert a relationship to them in the RELS-EXT section.

Next up in this series: the Content Model for technical documentation.

Thursday, October 8, 2009

Cold, Dark, and Lonely: An Archive Moves On-Line.

Carol Minton Morris of DuraSpace called me the other day with some good news: She told me that a short piece I wrote about Fedora and ICPSR would be published in their blog. The piece is called Cold, Dark, and Lonely: An Archive Moves On-Line.

While my colleagues at ICPSR have been alarmed by the title and suggested I seek immediate therapy for what must be an overwhelming foreboding of dread, the title was actually a poor riff on Thomas Friedman's Hot, Flat and Crowded tag. At least I think it was. (Maybe I should make that call after all.....)

Carol has also invited me to participate in the Sun/DuraSpace/SPARC webinar next week, All About Repositories. Should be a lot of fun!

Wednesday, October 7, 2009

ICPSR Job Posting in Technology - Cloud Computing Developer

We've posted the following position on the U-M employment site. (The site is just awful, but don't let that scare you off.)

We've listed it as a two year appointment to match the NIH Challenge Grant, but we've had a lot of success keeping staff employed quite happily and busily by generating more and more grant activity.

Job ID:	34671
Job Title:	Cloud Computing Developer
Job/Career Family:	Information Technology
Country:	United States of America
State:	Michigan
City:	Ann Arbor
Job Type:	Regular
Full Time:	Yes
Date Posted:	Oct 07 2009
Minimum Salary:	0
Maximum Salary:	0
Salary Frequency:	Annual
Job Description and Responsibilities:	Market Title: Systems Analyst Senior Working Title: Cloud Computing Developer FLSA: Exempt Salary Range: $70,000 - $80,000 depending on qualifications and experience of selected candidate Hours/Week: 40 Hours Shift/Hours/Days: Regular Business Please note this is a two year term limited appointment The Inter-university Consortium for Political and Social Research (ICPSR), established in 1962, is an integral part of the international infrastructure of social science research. ICPSR's unique combination of data resources, user support, and training in quantitative methods make it a vital resource for fostering inquiry and furthering the social sciences. ICPSR maintains and provides access to a vast archive of social science data for research and instruction. A unit within the Institute for Social Research at the University of Michigan, ICPSR is a membership-based organization, with over 600 member colleges and universities around the world. A Council of leading scholars and data professionals guides and oversees the activities of ICPSR. ICPSR offers a work environment that is a combination of the best aspects of a small nonprofit or business, established within a university setting. ICPSR is small enough that each person can make a difference, yet large enough to offer a variety of career opportunities. We have a relaxed, collegial atmosphere that fosters communication and networking within and between departments. We are family-friendly, offering flexibility with work hours, and we have a diverse staff that enriches the workplace with their skills and experience. ICPSR offers a competitive total compensation package providing full access to the University of Michigan benefits. More information can be found about ICPSR at www.icpsr.umich.edu. The ICPSR computing environment consists of Windows desktop workstations and UNIX servers. The desktop workstations run typical business applications such as Microsoft Office, but also run statistical software such as SAS and SPSS. The UNIX servers are based on the Intel/Linux platform and include Oracle databases, World Wide Web server software such as Apache, and a number of other major systems (e.g., tomcat, cocoon). Responsibilities Build a prototype secure data computing environment using public utility computing (as provided by the Amazon Elastic Computing Cloud's EC2) at the Inter University Consortium for Social and Political Research (ICPSR) that will provision an analytic computing instance that conforms to the underlying security requirements for data distributed under restricted use agreements and meets the analytic needs of end users and their research teams. Test the performance, security and usability of both the provisioning infrastructure and the analytic computing interface. Standard methods of testing system performance and security will be used as well as independent security assessments through white hat hacking. This position reports to the Assistant Director, Computer & Network Services, ICPSR, but projects will be assigned and priorities designated by the Project Principal Investigator. Note: Part of this job may require some work outside normal working hours to analyze and correct critical problems that arise in ICPSR's 24 hours per day operational environment. Duties -Works with technical staff to design, implement, and support cloud based and/or virtualized computing platforms for both internal and external users. -Creates automated process requiring little manual input for the creation of virtualized computer instances, user accounts, and data access. -Analyzes, proposes and designs the implementation of security interfaces in systems, applications, and network software. -Participates in the evaluation of proposed systems, applications, and network software to determine security, data integrity, and usability implications. Assess risks to data security and identify countermeasures, plan and implement technologies. -Provides third-level technical support for desktop and network systems, both virtualized and non-virtualized.
Job Requirements:	-Bachelor's degree in computer science, information systems, or equivalent combination of education and experience. -Experience with Cloud computing and provisioning (preferably with Amazon Elastic Computing Cloud). - 4+ years experience with collecting and documenting business requirements from users; then researching, designing, implementing, and supporting computing systems to meet those requirements. - 5+ years experience and expertise with installing, configuring, and programming Windows software in both virtualized and non-virtualized settings. -Experience and expertise in industry security training, such as SANS GIAC, or have work experience in security consulting or network security. -Experience with social science concepts, social science data, and analysis methods, and statistical applications (SAS, SPSS, etc) preferred. -Experience with Wise Package Studio or other MSI-packaging software preferred. -Ability to explain complex technical concepts to non-technical users and stakeholders. -Excellent customer service skills and customer-oriented focus. -Attentiveness to detail. -Excellent writing skills (writing samples will be required) -Ability to work independently while meeting deadlines, communicating issues, and providing detailed project status updates. -Ability to work within a team.

TRAC: C1.2: Backup infrastructure

C1.2 Repository ensures that it has adequate hardware and software support for backup functionality sufficient for the repository’s services and for the data held, e.g., metadata associated with access controls, repository main content.

The repository needs to be able to demonstrate the adequacy of the processes, hardware and software for its backup systems. Some will need much more elaborate backup plans than others.

Evidence: Documentation of what is being backed up and how often; audit log/inventory of backups; Trustworthy Repositories Audit & Certification: Criteria and Checklist validation of completed backups; disaster recovery plan—policy and documentation; “firedrills”—testing of backups; support contracts for hardware and software for backup mechanisms.

ICPSR has extensive documentation and infrastructure to support its core access functions even when a catastrophic failure disables its primary location in Ann Arbor, Michigan. The documentation - planning documents and instructions - reside in a Google Group, and all members of the IT team, and two of ICPSR's senior staff outside of IT are members of the group. The process has been used twice in 2009, once as a test, and once when the Ann Arbor site suffered a power failure.

ICPSR has a less well documented, but fairly prosaic, backup solution in place. All non-ephemeral content at ICPSR resides on a large Network Attached Storage (NAS) appliance. The IT team has configured the NAS to "checkpoint" each filesystem once per day, and each checkpoint is retained for 30 days. Checkpoints provide a read-only, self-serve backup system for those instances where a member of the staff has inadvertently damaged or destroyed non-archival content.

Further, we write all filesystems to a tape library, which is located in a different machine room than the NAS. Every two weeks tapes are removed from the tape library, and stored in yet a different building. We retain the last four weekly backups, and the last twelve monthly backups. The system is exercised on an infrequent, but regular basis when we restore files that were damaged or destroyed beyond the thirty day checkpoint window.

Finally, unlike "working" files where all copies reside locally, and where we retain only one year of content, our archival storage solution consists of copies in at least four locations. The master copy (1) is on the NAS; a copy (2) is written to tape each month; a copy (3) is synchronized daily with the San Diego Supercomputer Center's storage grid; and, a copy (4) is synchronized daily with the MATRIX Center at Michigan State University. Furthermore, archival content collected prior to 2009 has also been copied into the Chronopolis project storage grid, which adds two additional copies.

One area with room for improvement would be regular "fire drills" where we would attempt to retrieve a random number of random objects from an arbitrarily selected archival storage location.

Monday, October 5, 2009

DuraCloud

I attended a webinar on DuraSpace last Wednesday. As a big fan of "the cloud" I was very interested to hear about what's been built, how it could be used, and a roadmap of the future. I learned a little bit about all three topics on the webinar.

Gina Jones from the Library of Congress hosted the meeting, and the main speaker was Michele Kimpton.

DuraCloud is being built as an OSGi container sitting on top of cloud storage providers. Customers can view DuraCloud as a buying club for lower prices, and for easing the burden of learning the administrative and software interfaces of each cloud provider.

DuraCloud is starting a pilot project with four cloud providers: (1) Amazon, (2) EMC, (3) Rackspace, and (4) Sun. They are also working actively to add Microsoft as a fifth cloud provider. They have two content providers signed up for the pilot: the New York Public Library, and the Biodiversity Heritage Library.

The NYPL has 800k objects and 50TB of content. They'd like to use DuraSpace to make a copy of their materials, and to transform content from TIFF format to JPEG2000. The JPEG2000 images would then be pulled back out of the cloud to local storage at the NYPL.

The BHL has 40TB of content, and is hoping to use DuraCloud to distribute its content across multiple locations (US, EU), and as a platform for hosting computational intensive data mining.

The pilot is running through the end of the calendar year, and DuraSpace intends to have a pricing model in place by Q2 2010, and to launch a production service in Q3.

In response to a question from a participant, Michele indicated that the focus was NOT on securing sensitive data, but rather on hosting public data with open access. So DuraCloud might be a good bet for some of the content ICPSR delivers on its web site, for example, but not for medical records, confidential data, etc.

Friday, October 2, 2009

OR Meeting - Live Chat Session - Technology & Preservation at ICPSR

Next Wednesday I'll be participating in a "live chat" session with Nancy McGovern. I thought I might seed the technology portion of the conversation with these topics:

Fedora Repository software
OpenID authentication for the ICPSR Web site
Cloud services from Amazon Web Services

Since we only have 30 minutes, and since the agenda is for both technology and preservation, I don't want to include too many topics.

Anything important I may have left out?

Thursday, October 1, 2009

Teaching With Data Launched

ICPSR launched a new portal this week - TeachingWithData. This is part of ICPSR's grant to build a National Science Digital Library (NSDL) Pathway dedicated to social science data and teaching. I won't say too much more about the content in this post, but will instead focus on our technology selection.

The two main technical requirements were: (1) Fedora-based, and (2) fast deployment. We think we've achieved both with the site.

We implemented the site on an Amazon Web Services (AWS) Elastic Computing Cloud (EC2) instance; this allowed us to stand-up a base platform very quickly. We now have nearly a dozen instances running in AWS. Some are delivering production services, like our search technology and this Pathway, and others are used for science experiments and front-ends for storage. We've found that it's more convenient to synchronize and organize content in AWS by sticking a Linux instance in front so that we can use tools like rsync. Using Simple Storage Service (S3) directly is less convenient.

After looking at a variety of stacks that sit on top of Fedora, we selected Muradora for three main reasons:

Open source software
Active development community
Best technology fit with other technology platforms in use at ICPSR

Selecting Muradora allowed us to stand up the portal relatively quickly, and other time we'll evaulate how well the system is meeting our needs.

Enclave in the Clouds blog

My colleagues and I have started a separate blog to report progress on our NIH Challenge Grant. It's called Enclave in the Clouds, and the first post is from Felicia LeClere, PI on the grant.