Monday, April 30, 2012

Disaster Recovery at ICPSR : Part 4

Part 3 described how we activate the replica, and how it works.

Using the replica

We've used our replica system several times over the past three years.  Our usage falls into a couple of categories:

Scheduled maintenance.  There have been a couple of times where we've had scheduled maintenance, and we've pressed the replica into service.

The most recent instance was on February 12, 2012 when the campus networking guys upgraded the gear that connects ICPSR's home in the Perry Building to the backbone.  We executed the failover early on a Sunday, and then moved traffic back once we got the "all clear" signal.  This scenario tends to produce very good outcomes since we can plan for the transfer, and we aren't simultaneously trying to recover from some other problem.

Emergency failover.  The most common instances in this category are when the Perry Building loses power unexpectedly, and we need to move traffic over to the cloud replica as soon as possible.

This scenario also tends to have good outcomes since we can focus solely on the transfer, and there is relatively little we can do except wait for the power to be restored.  One complication can occur if the on-call engineer is not near a computer, and so there is a delay as s/he gets to the closest one.  Or, if the outage happens during the business day, we may need to execute the failover very quickly, before our UPS systems become drained.

Emergency non-failover.  This is the category that corresponds to those times when we actually do NOT press the replica into service, but should have in retrospect.

A common scenario is that we see an alert for a single service (say our Solr search engine), and we begin to troubleshoot the problem.  Initially we may not know whether the problem will be fixed in just a few minutes, or if it will turn into a multi-hour process.  My usual rule of thumb is to press the replica into service in 30 minutes have elapsed, and if it feels like we're not very close to solving the problem.

This can go very wrong, of course, if my "feeling" is wrong, and can go very, very wrong if my "feeling" is wrong and we are short-handed, and I'm the one who is knee-deep in troubleshooting.  It can be very easy to look up 90 minutes later and say, "Oops."

Managing the replica

In general managing the replica is very inexpensive and requires little monitoring (by humans).  We have found that the main effort occurs when we are making a major upgrade in a core piece of technology such as the hardware platform (32-bit to 64-bit), the operating system (RHEL 5 to RHEL 6), or the web server itself.  In practice it means that in addition to upgrading the staging and development environments at ICPSR, we also need to upgrade the replica environment as well, and so adds more of the same type of work, not a new type of work.

Friday, April 27, 2012

Disaster Recovery at ICPSR : Part 3

Part 2 described the virtual infrastructure we built in Amazon Web Services to deploy a replica of ICPSR's content delivery system.

Monitoring the replica system

This turns out to be pretty tricky.

The University of Michigan Network Operations Center (NOC) monitors both our physical servers located in the ICPSR machine room and our virtual servers in Amazon Web Services (AWS).  Monitoring the physical servers is very straight-forward, but monitoring the virtual servers is not.

For the virtual machines we need to pick a URL which does not require authentication or a cookie, and which will not be redirected.  We also need a URL that points to a simple page so that the monitoring system does not grab page elements from the production web server rather than the replica.  In practice we have found the barriers to be so plentiful and so daunting we have, in fact, done a pretty crummy job of keeping an eye on the health of the replica.

Until recently.

We now run an additional instance in AWS which has one sole purpose:  monitor the replica system.  And to make this fool-proof, we add the same little "lie" to /etc/hosts on the monitoring machine and point names like and to the cloud replicas rather than the production systems.  This has worked very well for us so far in 2012.

Initiating failover

Imagine that the ICPSR "on-call" has just received a series of SMS messages on the on-call cell phone.  Everything at ICPSR is down, and the campus alert system reports that the Perry Building (ICPSR's home) has lost power.  There is no estimate time for repair.  The world looks like this:

We initiate the failover procedure by changing the DNS CNAME records for and  Instead of "pointing" to the physical machines in the Perry Building, we point them to the cloud replicas.  If the failure did not include the production DNS server, we would make the change there.  However, in this scenario, the entire building has lost power, and so we need to make the change on the stealth slave server in AWS.

Now, as it turns out, the stealth slave server is recognized as a master server by the other slave servers for ICPSR's domain:  one at University of Michigan central IT and one at the San Diego Supercomputer Center.  Once we make a change to our server here (or in the cloud) those slave servers will pick it up within a few minutes.  And once they do, web requests start hitting our replica system rather than the production system.  And so the world changes from this:

to this:

in just a few minutes.

We can reverse the failover by making the same simple DNS record change, but in reverse.  We change the pointer from the cloud back to the physical systems in the ICPSR machine room.

Next: Part 4: Our experience with the replica over the past three years

Wednesday, April 25, 2012

Disaster Recovery at ICPSR - Part 2

Part 1 ended with ICPSR embarking on a project to build an off-site replica of its delivery system.

Amazon Web Services

I had been exploring Amazon Web Services (AWS) a little bit in late 2008, and had found it to be a very quick and easy way to stand-up technical infrastructure.  In contrast to the process we had been using to try to locate equipment at a University of Michigan data center, locating (virtual) equipment in AWS was astonishingly easy. I needed only a credit card and a Firefox plug-in to get started, and by using the excellent AWS-supplied tutorials I had soon deployed a stealth, slave DNS server for in AWS.  (A stealth server does not appear in the NS records for a domain.)

Also, AWS made it easy to grow into the cloud a little bit at a time.  Is a "small" virtual server under-powered for a replica of our production web server?  No problem, just terminate that virtual machine and relaunch the image on a "medium" virtual server.  Likewise we could add storage space when we needed it v. investing in a storage array which would be obsolete within two years.

We soon built enough infrastructure in AWS to serve as a replica, and it looks like this:

Click to enlarge.

Touring the replica

In addition to the slave DNS server we also stood up three additional servers in AWS.

One, a replica Oracle database server.  This is what AWS calls a c1.medium-sized instance, and mirrors the content we store in our production database.  We export content from the production database each morning, copy it to AWS, and then import it into the replica.

Two, a replica of our Child Care and Early Education Research Connections (CCEERC) web portal.  This portal runs on a virtual interface on the production web server, but it isn't so easy to add virtual interfaces to AWS instances.  This is what AWS calls an m1.small-sized instance, and provides the same basic content and functionality as  We use rsync over ssh twice each day to keep content and web applications up to date.

Three, a replica of our main web portal.  This runs on what AWS calls a m1.large-sized instance since it bears the largest burden of any component.  Like with the CCEERC replica we synchronize content here on a twice daily basis.  We also disable certain web applications, like the Deposit System, so that we do not introduce potentially sensitive content to the cloud.  However, common services like search, browse, download, and analyze online are all available.

Each replica has a list of little white lies inside /etc/hosts that lead each machine to believe that and really do reside in AWS.  This trick allows us to run the same apps in the cloud without resorting to fragile, high maintenance software modifications that try to distinguish between systems in the cloud and systems in ICPSR's machine room.

Next up: Part 3: Using the replica

Monday, April 23, 2012

Disaster Recovery at ICPSR - Part 1

I'll be running a series on disaster recovery planning (DRP) and execution at ICPSR.  I'm responsible for ensuring that we have a working disaster recovery plan for two key areas of ICPSR:  delivery of content via our web site, and preservation of content via archival storage.  The requirements and solutions of the two areas are quite different, and I'll address each one separately.

This first post will focus on disaster recovery for our web-based delivery system.


After a particularly long outage (3-4 days) in late 2008 due to a major ice storm that knocked out the power to our building, ICPSR made the decision to invest in a disaster recovery plan for our web-based delivery system.  The idea was to create a plan which would allow my team to have the process and infrastructure in place so that we could recover from a disaster befalling our delivery system.  We defined "disaster" to be an outage which could conceivably last for many hours or even days.  And the goal was to be able to recover from a disaster within one hour.

It is important to note that we were not intending to build a "high availability" delivery system.  The goal of that type of system would be to move ICPSR into the so-called "five nines" level of availability, meaning that our infrastructure would be available at least 99.999% of the time.  Converting ICPSR's plethora of legacy systems and infrastructure into such a high availability system would be a major project requiring a significant investment over several years.

Instead we set the bar lower, but not too low.  What if ICPSR had a goal of 99% availability each month? In that scenario we do not need the level of investment and infrastructure to avoid almost all down-time; we only need to be able to recover from down-time quickly, and to prevent any long outages.  The investment to reach that goal would be much smaller, and it would serve our community well.

The Starting Point

At this point in time we already had reasonably robust local systems - powerful servers for web and database services, an enterprise-class storage system, and UPS backup for all of our systems.  In addition, the University of Michigan Network Operations Center (NOC) was monitoring our systems 24 x 7.  The NOC's network monitoring system (NMS) sent automated emails to us whenever a component faulted.

However, we did not have any sort of on-call rotation ensuring that a fault would be caught and corrected quickly, and we also did not have any backup or replica system which could be pressed into service if, say, our building lost power for several hours (or days).  So we were exposed to short outages becoming unnecessarily long, and to long outages where we had no control over the recovery time.

We were able to address the first issue quickly and effectively by establishing an on-call rotation, where the "on-call" served one week at a time and carried a cell phone which received SMS alerts from the NOC's NMS.  This meant that faults would now be picked up and acted upon immediately by someone on the ICPSR IT team.  This alone would eliminate one class of long-lived outages, for example, where a fault would occur late on a weekend night, but not be picked up for repair until Monday morning.

The next step was to design, build, deploy, and maintain a replica of our delivery system.  But where?

Next up:  Part 2:  Building the replica

Friday, April 20, 2012

It's official - we have no love for khugepaged

As web site visitors - and the IT staff - experienced through February and March after we upgraded to RHEL 6, life was tough.  Very tough.

We saw two months of ghastly web service availability, well below the 99% goal.  Lots of pages in the middle of the night.  Lots of trips to the office at all hours and all days to cycle power on the server.

It was clear that khugepaged was involved somehow.  Was it the victim of something else?  Or the cause?

Based on the most scanty of evidence and great desperation we disabled khugepaged on March 29.  And since then?

[ sound of knocking on wood ]

The machine is back to its old self.  One very short-lived (seven minutes) outage based on a bad rewrite rule that we added in response to a request, and then had to back-out.

Who knew that this simple command:

root# echo never> /sys/kernel/mm/redhat_transparent_hugepage/enabled

could generate so much happiness?

Wednesday, April 18, 2012

Great FLAMEing file identification service

Some parts of the FLAME project will lend themselves to a microservices approach.  Microservices, like cloud computing, is a trendy, useful concept, but without a crystal clear definition.  But my take is that a microservice is something that performs one small, but useful bit of work, and which can be swapped in and out of an overall architecture at a component level.  It needs to have very clear inputs and outputs, and cannot contain any "secret sauce" that isn't part of its functional role.

Do not try this street magic at home.
One common activity at ICPSR is automated file identification.  Historically we've done this with the venerable UNIX utility file, but where we modify the magic database heavily, particularly for the formats we see most often.  We also post-process the output from file where we need additional handling above and beyond the capabilities of the magic database (e.g., making decisions based on the name or extension of the file).

Managing the magic database is not for the faint of heart.  (Try updating the Vorbis section.)  And this management has gotten both harder -- RHEL 6 uses a new format for its magic database which is incompatible with RHEL 5 -- and easier -- the new format eliminates the pesky magic.mime database.  However, we've gotten reasonably competent at managing magic and have come to rely on it for file format identification.

In support of the FLAME project we even created a little web service that takes a file's content and its name as input, and delivers a little snippet of XML as the output.  The XML contains the "human readable" answer from our magic database and the "MIME type" too.  This is our first FLAME-inspired web service.

If you'd like to try it, you can use your favorite form-capable URL transfer utility to do so.  Here's an example where I have run curl on one of our RHEL machines:

dhcp-bryan:; curl -F "file=@uuid-comparison.xlsx;filename=uuid-comparison.xlsx"<?xml version="1.0" encoding="utf-8"?><wsifile><ifile>Microsoft Excel</ifile><ifilemime>application/zip; charset=binary</ifilemime><uploadInfo>application/octet-stream</uploadInfo></wsifile>

feeding in an Excel file as the input, and another with a plain text file:

dhcp-bryan:; curl -F "file=@/etc/resolv.conf;filename=resolv.conf"<?xml version="1.0" encoding="utf-8"?><wsifile><ifile>ASCII text</ifile><ifilemime>text/plain; charset=us-ascii</ifilemime><uploadInfo>application/octet-stream</uploadInfo></wsifile>

and an interesting MS Word file:

dhcp-bryan:; curl -F "file=@2011-03CouncilPandAminutes.doc;filename=2011-03CouncilPandAminutes.doc"<?xml version="1.0" encoding="utf-8"?><wsifile><ifile>CDF V2 Document, Little Endian, Os: Windows, Version 5.1, Code page: 1200, Number of Characters: 0, Name of Creating Application: Aspose.Words for Java, Number of Pages: 1, Revision Number: 1, Security: 0, Template:, Number of Words: 0</ifile><ifilemime>application/msword; charset=binary</ifilemime><uploadInfo>application/octet-stream</uploadInfo></wsifile>

Feel free to try it out, and to post reactions, suggestions here.

Monday, April 16, 2012

The nature of ICPSR's holdings

At the end of 2011 ICPSR had about 9TB of content stored in Archival Storage.  This measurement includes everything we have collected over the past 50 years, including content which is not packaged into "studies" for dissemination, such as TIGER/Line files and data packaged for SDA.  This content is not compressed, and contains many duplicates[1], and so should be considered an upper bound.

As we head into the start of Q2 in 2012 the quantity of content in Archival Storage has edged up just a little bit; it may be as much as 9.1 TB now.  And I would guess that we have another 100GB or so of content in Ingest storage, making its way through the ICPSR data curation process.

The big news, though, is the amount of non-survey content one finds in Ingest storage:  7.4TB.  And growing.  Fast.

As video content from the Bill and Melinda Gates Foundation Measures of Effective Teaching project continues to arrive it won't be much longer before the amount of video content equals the amount of survey data content.  By the end of the calendar year I expect that we will have more video than survey data.

Long-time ICPSR staff tell the story of how the 2000 Census doubled the size of ICPSR's holdings.  (I'll speculate that perhaps ICPSR went from about 3TB of content prior to the 200 Census, and then grew to 6TB thereafter.)  In 2012-2013 ICPSR is likely to quadruple the size of its holdings, growing from about 9TB to nearly 40TB.

Wednesday, April 11, 2012

March 2012 Web availability

March 2012 was not kind to us.

Clicking the image will display a full-size chart.  But please don't.  It is too ugly.

The main culprit in March was a continuing problem with the reliability of the production web server.  The environment - cooling, electricity, humidity - was fine, and the individual web applications were also fine, but something is not quite right with the kernel.  I think.  (If you would like to join the team as our new Senior Systems Architect and help solve the problem, see my post from last week.)

In March we saw multiple outages, each lasting over an hour.  The script always went something like this:

  1. Load average increases by 5000-10000%
  2. One web application stops responding and logging
  3. KERN.INFO error messages from khugepaged and jsvc appear in syslog
  4. Attempt to restart web application
  5. Fail
  6. Attempt to restart all web apps and their containers
  7. Fail
  8. Attempt to reboot machine
  9. Fail
  10. Optional:  Drive into office if weekend or early morning
  11. Attempt to cycle power on machine
  12. Mix of foul language and prayer
  13. Repeat step #12
  14. Success - machine is working again
Because we use the cloud instead of local, physical servers for many services, and because we haven't had all that many times where the machine needed its power cycled to solve a problem, we don't have things set-up for remote power access.  We'd like to address that.  (If you would like to join the team as our new Senior Systems Architect and help solve the problem, see my post from last week.)

So here's the plan to have a better April:
  1. Disable khugepaged, hoping this might stop the machine from seizing up
  2. Drive faster to the Perry Building, hoping this might result in faster applications of turning the power off and on
  3. Hire the Senior Systems Architect, hoping that having a third pair of eyes on the problem might reveal its true cause and solution
  4. Mix of foul language and prayer, hoping it will ease the pain
And, more seriously, we have also updated a few apps (like Solr) to use local storage rather than NFS-mounted storage for their work, particularly if the app tends to do a lot of writing to the filesystem.  NFS seems to be part of the mystery too.

Monday, April 9, 2012

March deposits at ICPSR

Chart?  Chart.
# of files# of depositsFile format
52text/plain; charset=unknown
8716text/plain; charset=us-ascii
11text/x-c; charset=unknown
71text/x-c; charset=us-ascii

A blissfully normal month of deposits.  Usual types.  Usual volumes.

Still need to tweak the automated MIME type detector to stop reporting that it is finding C source code.  The eight files above are most likely plain text files that just happen to have something like a pound-sign or "slash-star" sequence starting in the first column.

Not shown here - because it isn't passing through the deposit system - is a considerable volume of video content from the Gates Foundation.  We have a bit over 6TB that we received in early 2012, and about 1TB of a 20TB collection that will arrive in a steady stream over the next 12-16 months.

If our policy is that the ICPSR deposit system is just one of many mechanisms for ICPSR to accept content, then this seems OK.

But, if we expect the deposit system to be the complete and correct record of ALL incoming content, then we do have a problem.  A 7TB problem that is will grow up to be a big and strong 26TB problem at some point.

Friday, April 6, 2012

Dead On Annihilator Superhammer

I can see all sorts of good uses for this.  Everyone machine room should include one.

In addition to helping with routine maintenance, I think this would also be perfect for any future Zombie Apocalypse.

UnDead On Annihilator Superhammer?

Thanks to Cory and the gang at Boing Boing for pointing this out.

This is the sort of thing that we should be giving away at conferences with a nice ICPSR logo on it.

Wednesday, April 4, 2012

Systems Architect Senior

We're still looking to add a senior person to the team!

The University of Michigan jobs site is notorious at discarding information about jobs, so I will include the posting below and the link here.  In brief we're looking for a senior systems person who can design and build delivery and preservation systems, and who (ideally) also has some experience with commercial cloud providers, such as Amazon.

Systems Architect Senior

Job Summary

The Inter-university Consortium for Political and Social Research (ICPSR), the world's largest archive of digital social science data, is now accepting applications for a Systems Architect Senior. ICPSR is a unit within the Institute for Social Research at the University of Michigan. ICPSR's data are the foundation for thousands of research articles, reports, and books. Findings from these data are put to use by scholars, policy analysts, policy makers, the media, and the public. This position reports to the Assistant Director, Technology Operations, Computer and Network Services.


This position is responsible for the design, implementation, maintenance, and regular management of ICPSRs web systems development, staging, production, and disaster recovery operational environments. This consists of several distinct platforms, including local physical hardware and virtual systems hosted in Amazons Elastic Computing Cloud (EC2). The successful candidate will also work closely with the Assistant Director, Software Development, Computer and Network Services to define and implement functional requirements.

One, this position will select, install, and manage integrated development environment (IDE) software on developer workstations, and the underlying software repository. The incumbent systems are Eclipse and CVS, respectively.

Two, this position will select, install, manage, and maintain the testing, staging, and production platform environments used by ICPSR to deploy and test new web applications. The incumbent web application server is Apache Tomcat, sometimes run as a stand-alone web server and sometimes as a client to Apache Httpd. The incumbent server platform is a mix of local, physical servers and Elastic Computing Cloud (EC2) instances running within Amazon Web Services (AWS). ICPSR has interest in exploring a more complete. cloud-based web application platform such as AWS Elastic BeanStalk.

Three, this position will manage the AWS-hosted replica of ICPSRs production web environment. This includes building and maintaining tools that synchronize software, static content, and database content between the production environment and the replica environment.

Four, the over-arching responsibility of this position is to maintain and improve ICPSRs capacity for delivering high-availability, high-performance web-based services, managing the tension between the desire to have well-defined, documented, predictable deployments and business processes and the desire to have fast moving, fluid, and flexible deployments and business processes.

Required Qualifications*

BS in Computer Science, Computer Engineering, or at least eight years of experience with designing and managing complex web application hosting environments
Two or more years of experience with J2EE application servers (such as Tomcat)
Two or more years of experience with virtualization products or services, such as Amazon Web Services
In-depth expertise with RedHat Enterprise Linux 5 and 6
In-depth knowledge of networking principles and network support
In-depth knowledge of Web technologies (Apache, Tomcat)
Experience operating an RDBMS (Oracle, MySQL)
Experience working with monitoring tools, control software (CVS , Subversion, Perforce), and build tools (Make , Ant)
Strong understanding of web application architectures
Enthusiastic self-starter who works well with other team members
Excellent inter-personal skills with the ability to communicate clearly to peers, vendors, customers, and colleagues

Desired Qualifications*

MS in Computer Science or Computer Engineering
At least five years of experience with J2EE application servers (such as Tomcat)
Experience with Veritas products (Veritas Netbackup)
Storage experience (EMC)
Experience as an Infrastructure Engineer in a high availability environment
Expertise working with monitoring tools , version control software (CVS , Subversion, Perforce), and build tools (Make , Ant)

Underfill Statement

This position may be underfilled at a lower classification depending on the qualifications of the selected candidate.

U-M EEO/AA Statement

The University of Michigan is an equal opportunity/affirmative action employer.

Monday, April 2, 2012

FLAME update

We have three different tracks running on FLAME.

One track is conducting an analysis of the business requirements ICPSR has for what we are calling our "self archived" collection.  This is a collection of material best represented today by our Publication Related Archive, a set of materials that receives very, very little scrutiny between time of deposit and time of release on the web site.  We are imagining a future world where the quantity of "self-archived" materials increases dramatically from today's volumes, driven by NIH and NSF requirements to share and manage data.

I see the following questions generating the most discussion on this track:  How much disclosure review is necessary before releasing the content publicly?  Should the depositor have "edit" access to the metadata?  If so, should it be moderated or completely open?  How much "touch" does ICPSR need to have on these materials?

Another track is working on a crisp, concrete definition of what it means to "normalize" a system file from SAS, SPSS, or Stata.  ICPSR has long said that our approach is to "normalize" such files, producing plain ASCII data and set-ups, but what does that really mean?  And is that really possible?

I see the following questions generating the most discussion on this track:  Is ASCII the right thing, or ought it be a Unicode character set?  Are set-ups the right documentation or should it be DDI XML?  If we choose the former, is it sufficient to produce set-ups compatible with the original content type (e.g., SAS setups for a SAS file)?  What about precision?  Length of variable names?  Question text?  Is it possible to normalize without loss, and if not, how much loss is acceptable?  Can a computer do this without human intervention 99% of the time?

And the last track is working on a matrix that maps a set of parameters (inputs) to a resulting preservation commitment and set of actions.  For example, if one has a file which contains "documentation" (the type of content) in XML format in the UTF-8 character set (the format of the file), then perhaps the preservation commitment is "full preservation."

The key questions here, I believe, will be around what the right list of parameters is.  And if any of the parameters uses a controlled vocabulary, what's in the CV?  And what exactly does it mean to have a "full preservation" commitment?  What's involved beyond just keeping the bits around, which is presumably all one does with "bit-level preservation?"