Wednesday, July 25, 2012

My Amazon Web Services wishlist for the U-M

I'm serving on a University of Michigan task force that is looking at ways in which we can make cloud computing easier for faculty, students, and staff to consume.  This presupposes, of course, that at least some of the university community have research or business or other needs that would be well served by a cloud-type solution.

For those of us that are already using the cloud to solve a few different problems -- off-site archival copies, disaster recovery solution for delivery systems, among others -- the problem isn't so much how to get us to use cloud computing, but how the U-M can help us get the most value for our dollar.

With this in mind I offer my Amazon Web Services (AWS) wishlist for the U-M:

  • Build Amazon Machine Images (AMI) for 64-bit Red Hat Linux (and, optionally, 64-bit Windows Server). Put any security or system or software goodies into the image that would be available to the entire university community (IT directors, grad students, casual users). This saves us from needing to build and maintain our own AMI or, worse, using one from a third-party.
  • Deploy an AWS Virtual Private Cloud (VPC) that connects our own little piece of AWS “cloud space” to the rest of campus over a secure link. Allow instances running within this VPC to access infrastructure such as Active Director. Treat this part of AWS as if it were just another network (or data center) on campus. This enables us to deploy services dependent upon campus infrastructure in AWS more easily.
  • Deploy an AWS Direct Connect between the VPC and UMnet (or Merit [the State of Michigan research and education network] or Abilene [Internet2's national network]). This grants us a fast, secure, inexpensive pipe for moving content between campus and AWS. We could start to deploy I/O-intensive resources in AWS more readily if we don’t have to pay for the bits individually.
  • Implement an agreement where AWS has one customer (the University of Michigan) rather than many. (ICPSR alone has four different identities within AWS, largely so that we can map expenses from one identity to a university account.) This one customer would have different sub-accounts, and the usage across ALL of the sub-accounts would roll up to set pricing. ICPSR stores over 1TB of content in AWS S3 for example, and so our GB/month rate is $0.11. Other uses at U-M who store content in AWS S3, but less than 1TB, are paying over $0.12/GB/month. That is only a small amount more than $0.01/GB, but it adds up over ALL accounts each month.
  • Explore the feasibility of allowing one to use U-M credentials (via Shibboleth?) to access key web applications at AWS, such as the AWS Management Console. We currently have to provision a separate email address and local (to AWS) password.
  • Explore the feasibility of using an AWS Storage Gateway as a means to deliver additional storage needs for bursty or short-lived storage needs. It would be fabulous if we could buy nearly unlimited space in the U-M storage cloud. This is more feasible if we can use AWS storage for short-lived "bursts" of temporary storage.

Monday, July 23, 2012

ICPSR web site maintenance

We ran into a few problems last Wednesday during our system update.  We rolled back the changes, and are giving it another go this evening at 10pm EDT.  We're going to move traffic to our replica in the cloud during the maintenance so that we have more time for troubleshooting.

Our cloud replica has many of the features of our main site (search, download, analyze), but does not include features that transfer materials to ICPSR, such as our online Deposit Form.

Friday, July 20, 2012

Amazon's loss is SDSC's gain

One of the recent Amazon Web Services (AWS) power outages has left some of my EBS volumes in an inconsistent state.  If these were simple volumes, each containing a filesystem, then the fix is easy:  just dismount the filesystem, run fsck to check it, and then remount the filesystem after it has been fixed.  We have done this on several of our EC2 instances that had inconsistent volumes.

Unfortunately, for these particular volumes we have bonded them together to form a virtual RAID.  And this RAID is used as a single multi-TB filesystem which is much bigger than fsck can handle.  So we are kind of stuck.

One option would be to newfs the big filesystem, and to move the several TBs of content back into AWS, but that would be very slow.  And if there is another power outage......

So instead we called up our pals at Duracloud and asked them if they could help us enable replication of our content to a second provider.  (The first provider is - ironically - AWS.  But their S3 service, not their EC2/EBS service.)  They said they'd be happy to help, and, in fact, they will starting to replicate our content later this same week.  (Now that's service!)

The new copy of our content will now be replicated in...... SDSC's storage cloud.  This really brings us full circle at ICPSR since our very first off-site archival copy was stored at SDSC. Back then (like in 2008) it was stored in their Storage Resource Broker (SRB) system, and we used a set of command-line utilities to sync content between ICPSR and SDSC.  

The SRB stuff was kind of clunky for us, especially given our large number of files, our sometimes large files (>2GB), and our sometimes poorly named files (e.g., control characters in file names).  Our content then moved into Chronopolis from SRB, and then at the end of the demonstration project, we asked SDSC to dispose of the copy they had.  But now it is coming back......

Wednesday, July 18, 2012

ICPSR web maintenance

We're updating a few pieces of core technology on our web server this afternoon:  httpd, mod_perl, Perl, and a few others.  Normally we like to perform maintenance like this during off-hours, but we're doing it at 12:30pm EDT today so that we have "all hands on-deck" to troubleshoot and solve problems.

We've already performed this maintenance on our staging server, and that went smoothly.  Our expectation is that this maintenance will last 15-30 minutes.

Monday, July 16, 2012

ICPSR 2012 technology recap - where did the money come from?

We're putting together some summary numbers for technology spending and investments at ICPSR for FY 2012.  (The ICPSR fiscal year is the same as the University of Michigan's, and runs from July 1 to June 30.  We've just recently closed FY 2012.)

The first set of numbers shows the allocation of effort in FY 2012 by funding source. The unit of measurement in this pie chart is HOURS (not DOLLARS) that were expended in FY 2012 by each funding source.  (We originally wanted to calculate dollars, but that turns out to be an even bigger effort.)  Here's an interactive chart:

This is an interactive Google Docs chart.  If you click slices of the pie, it will identify the funding source.

The main source of technology effort funding comes from the Computer Recharge, an hourly "tax" that ICPSR levies against all projects.  Although it is one single funding source (nearly 45% of hours worked in FY 2012 were billed against this source), I have split it into two sub-categories, one for what I am calling "IT" and one for "SW" (software).

The "SW" portion includes the effort of all staff who are professional software developers.  The type of work performed by this team using this account includes enhancements and maintenance for ICPSR's core data curation and data management systems, and investments in new products and services such as software developed to support our IDARS system for applying for access to datasets.

The "IT" portion includes the effort of the remainder of the staff which tends to include systems administrators, architects, network managers, and desktop support specialists.  I also allocate my own time to this bucket since the majority of my non-contract, non-grant effort over the past year has been in building and architecting technology systems.

Other big slices of the "IT pie" include the work of staff members who are explicitly funded by projects such as our CCEERC and RCMD web portals; our two Bill and Melinda Gates Foundation grants; the ICPSR Summer Program, and many more.  In fact, there are over 20 separate funding sources used to support technology at ICPSR; the pie chart shows 18 because I grouped several small ones into a category called "Misc."

If this gives the impression that there are many, many projects and activities at ICPSR that involve technology, that's good!  That is certainly the case.

However, if "focus wins" then we're in a little bit of trouble.  My sense is that each of this 20-some funding sources has at least one unique project with its own business analysis and project management needs, and it is sometimes the case that different projects have antithetical technology needs.  I see this play out in all phases of the OAIS lifecycle.  ("I want you to build a system that makes it as easy as possible to fetch datasets from ICPSR" v.  "I want you to build a system that requires significant effort and oversight to fetch datasets from ICPSR.")

Friday, July 13, 2012

Surviving the move to Google

The University of Michigan is moving its business productivity systems (mail, calendar, among others) to Google this year.  Some university institutes and colleges have already made the move, and others will make the move later this year.  ICPSR and its parent, the Institute for Social Research, will move in August, although about 70 staff at ISR will make the move early.  These "Google Guides" will help others with the transition.

One concept that might be helpful for the move is to distinguish between an email address and email mailbox.

An email address is basically a pointer.  It can point to another email address, or it can point to an email mailbox.  People publish and share their email address with others, and this is the piece of information we use to target a piece of email.

An email mailbox is a place where email lands.  You log in to an email system (Gmail, Exchange, AOL, and many more) with an ID and password, and once there, you can search for messages, read messages, sort, filter, send, and the rest.

The big change at the U-M this year with email is with everyone's email mailbox.  It's changing from some legacy system at the U-M to Gmail.

Below is a typical example of how things work today at ICPSR.  The email address is just a pointer to the U-M enterprise directory entry.  That entry - which ends with - is also just a pointer to another email address ending in  And that email address is the thing that actually points to the email mailbox, which in our case is the ISR Exchange server.  The first diagram below shows the relationship between email addresses and email mailboxes in the current system for my own email:

Contrast that with the image just above that shows the relationships after the move to Google.  There are two main changes.

The first is that the email mailbox now lives in Google Gmail rather than Exchange, and my email software is a web browser rather than Outlook.  This is a very big change.  Some will find it a pleasant change, and others will hate the new system.

The second is that the roles of the and email address have reversed.  The email address is now just a pointer to another email address, and the email address is the one that "points" directly to Gmail.  And, of course, just like before, one can publish or use any of the three addresses, and the mail goes to the same email mailbox.

Wednesday, July 11, 2012

Tagging EC2 instances and EBS volumes

Adding this to my Amazon (Web Services) wishlist....

Optional Billing tag which can be set when an EBS volume is created or when an EC2 instance is launched

There is a lot of convenience is having a single AWS account.  It makes it easier to find running instances in the AWS Console.  It eliminates the need to share AMIs across accounts.  It obviates the need to remember (and record) multiple logins and passwords.

However, there is one big win in having multiple AWS accounts:  It makes it easier to tie the charges for one set of cloud technology (one account) to a revenue source.  And so we often have four or five different AWS accounts for the four or five different projects we have underway.

It would give me the best of both worlds if I could have my single AWS account, but then specify a special-purpose tag (say, Billing) when I provision a piece of cloud infrastructure.  This would be an optional tag that I could set I launch an instance or create a volume.  This tag would control the format and grouping of charges on my monthly AWS invoice for that account.

For example, say I launch a small instance and set the value of the Billing tag to U12345 (a made-up University of Michigan account number).  And then I launch a second one with a Billing tag of F56789.  And then in addition to the usual AWS invoice with a line item like this:

AWS Service Charges

Amazon Elastic Compute Cloud
US East (Northern Virginia) Region
    Amazon EC2 running Linux/UNIX
        $0.080 per Small Instance (m1.small) instance-hour  1440 hours   $115.20

I would see an additional section:

AWS Service Charges by tag

Amazon Elastic Compute Cloud
US East (Northern Virginia) Region
    Amazon EC2 running Linux/UNIX
        $0.080 per Small Instance (m1.small) instance-hour  720 hours   $57.60

Amazon Elastic Compute Cloud
US East (Northern Virginia) Region
    Amazon EC2 running Linux/UNIX
        $0.080 per Small Instance (m1.small) instance-hour  720 hours   $57.60

This would make it easy for me to take my single invoice from Amazon and "allocate" it (a term from Concur, the system we use for managing this sort of thing) to the right internal account.

Monday, July 9, 2012

June 2012 deposits at ICPSR

Deposit numbers from June:

# of files# of depositsFile format
361text/plain; charset=iso-8859-1
313text/plain; charset=unknown
38910text/plain; charset=us-ascii
11text/plain; charset=utf-8
43text/x-mail; charset=us-ascii

Quite a bit of Stata this month, much more than normal.

Wednesday, July 4, 2012

Ixia Communications acquires BreakingPoint Systems

A friend of mine mailed me a link to a TechCrunch article that got me thinking about ICPSR:

Network Testing Consolidation: Ixia Pays $160M Cash For Security-Focused BreakingPoint Systems

So what does this have to do with ICPSR?

Almost eleven years ago, Ixia made its very first acquisition:

Ixia Announces the Acquisition of Caimis, Inc.

(The link above is from the Internet Archive's Wayback Machine.)

Caimis was a small software company that a handful of us founded in 2000.  Some had come from a pioneering Internet company called ANS Communication, and were looking for something very different after having been acquired by Worldcom in 1998.  And others were from CAIDA, which is very much still alive and well (unlike ANS Communications or Worldcom).

Founding and growing Caimis was an exciting time, and selling the company to Ixia was a hard, but good, decision for us.  The deal closed in late 2001 just after the 9/11 attacks, and that made the long flight to Los Angeles to finalize the papers even more "exciting" than usual.

Ixia was a maker of hardware and had a pretty thorough process for manufacturing systems, assigning part numbers to every last item, and managing projects with an amped-up version of MS Project.  We were a very small, very loose software company with very little process.  This led to a gigantic clash in cultures, and things took a turn for the worse after six months:  Ixia decided to close down the Ann Arbor office and shut down several projects.

Like a few others, I decided to stay in Ann Arbor, and was looking around for the next thing to do in mid 2002.  Eventually I came across an ad in the NYT or perhaps Chronicle that a place called ICPSR was looking to hire a new technology director.

Working at an organization which was unlikely to be sold, or moved, or merged, or.... was very attractive at that time. Also, working in a more stable situation was highly desirable after seven years of constant change and turmoil (some good but some not very nice at all).

The job itself looked interesting.  Basically a CIO/CTO type job at a medium-sized not-for-profit.  Technology leader.  Part of the senior management team.  Work closely with the CEO.  And the entire team was in Ann Arbor - no more late night and early morning phone calls on a routine basis with colleagues and employees all over the world!  So I interviewed and got the job, and having been having a lot of fun at ICPSR ever since.

Monday, July 2, 2012

Leap second - No, sir, don't like it

Leap second:

I remember the big Y2K todo.  Lots of hype, lots of worry, lots of prep, and then nothing happened.

But this Leap Second from last Saturday.  Sheesh.

I saw our production web server's java-based webapps go into tight little loops that consumed lots of CPU, and did little web application serving.  I rebooted that Saturday night.

Then on Sunday I saw a report that our (non-production) video streaming server (based on Wowza which is a java webapp too) had become unresponsive.  We rebooted that early, early Monday morning.

And then our staging server freaked out too, and we rebooted in Monday morning.

I don't like leap second.

Amazon Web Services makes Tech@ICPSR weep

June 2012 was looking to be a great, great month for uptime.  We were on track to have our best month since November 2011 this fiscal year - just 60 minutes of downtime across all services and all applications.  It was going to be beautiful.

And then Amazon Web Services had another power failure.

And then we wept.

The power failure took the TeachingWithData portal out of action.  (To be fair, it was already having significant problems due to its creaky technology platform, but this took it all the way out of action.)  The failure also took our delivery replica out of action, and gave Tech@ICPSR the joy of rebuilding it over the weekend.

But the real trouble was with a company called Janrain.

Janrain sells a service called Engage.  Engage is what allows content providers (like ICPSR) to use identity providers (like Google, Facebook, Yahoo, and many more) so that their clients (like you) do not need to create yet another account and password.  Engage is a hosted solution that we use for our single sign-on service using existing IDs, and it works 99.9% of the time.

However, this hosted solution lives in the cloud.  We just point the name at an IP address we get from Janrain, plug in calls to their API, and then magic happens.

Except when the cloud breaks.

Amazon took Engage off-line for nearly four hours.  And then once it came back up, it was thoroughly confused for another three hours.  Ick.

So, counting all of that time as "downtime" our fabulous June 2012 numbers suddenly became our awful June 2012 numbers.  Here they are:

If you click on the image above, Blogger will make it bigger.

Of course, during a lot of that downtime, all of the features on the web site except for third-party login worked fine.  And most of the problem happened late on a Friday night and Saturday morning during the summer, so that's a good time for something bad to happen, if it has to happen at all.