Technology at ICPSR

Wednesday, July 25, 2012

My Amazon Web Services wishlist for the U-M

I'm serving on a University of Michigan task force that is looking at ways in which we can make cloud computing easier for faculty, students, and staff to consume. This presupposes, of course, that at least some of the university community have research or business or other needs that would be well served by a cloud-type solution.

For those of us that are already using the cloud to solve a few different problems -- off-site archival copies, disaster recovery solution for delivery systems, among others -- the problem isn't so much how to get us to use cloud computing, but how the U-M can help us get the most value for our dollar.

With this in mind I offer my Amazon Web Services (AWS) wishlist for the U-M:

Build Amazon Machine Images (AMI) for 64-bit Red Hat Linux (and, optionally, 64-bit Windows Server). Put any security or system or software goodies into the image that would be available to the entire university community (IT directors, grad students, casual users). This saves us from needing to build and maintain our own AMI or, worse, using one from a third-party.
Deploy an AWS Virtual Private Cloud (VPC) that connects our own little piece of AWS “cloud space” to the rest of campus over a secure link. Allow instances running within this VPC to access infrastructure such as Active Director. Treat this part of AWS as if it were just another network (or data center) on campus. This enables us to deploy services dependent upon campus infrastructure in AWS more easily.
Deploy an AWS Direct Connect between the VPC and UMnet (or Merit [the State of Michigan research and education network] or Abilene [Internet2's national network]). This grants us a fast, secure, inexpensive pipe for moving content between campus and AWS. We could start to deploy I/O-intensive resources in AWS more readily if we don’t have to pay for the bits individually.
Implement an agreement where AWS has one customer (the University of Michigan) rather than many. (ICPSR alone has four different identities within AWS, largely so that we can map expenses from one identity to a university account.) This one customer would have different sub-accounts, and the usage across ALL of the sub-accounts would roll up to set pricing. ICPSR stores over 1TB of content in AWS S3 for example, and so our GB/month rate is $0.11. Other uses at U-M who store content in AWS S3, but less than 1TB, are paying over $0.12/GB/month. That is only a small amount more than $0.01/GB, but it adds up over ALL accounts each month.
Explore the feasibility of allowing one to use U-M credentials (via Shibboleth?) to access key web applications at AWS, such as the AWS Management Console. We currently have to provision a separate email address and local (to AWS) password.
Explore the feasibility of using an AWS Storage Gateway as a means to deliver additional storage needs for bursty or short-lived storage needs. It would be fabulous if we could buy nearly unlimited space in the U-M storage cloud. This is more feasible if we can use AWS storage for short-lived "bursts" of temporary storage.

Monday, July 23, 2012

ICPSR web site maintenance

We ran into a few problems last Wednesday during our system update. We rolled back the changes, and are giving it another go this evening at 10pm EDT. We're going to move traffic to our replica in the cloud during the maintenance so that we have more time for troubleshooting.

Our cloud replica has many of the features of our main site (search, download, analyze), but does not include features that transfer materials to ICPSR, such as our online Deposit Form.

Friday, July 20, 2012

Amazon's loss is SDSC's gain

One of the recent Amazon Web Services (AWS) power outages has left some of my EBS volumes in an inconsistent state. If these were simple volumes, each containing a filesystem, then the fix is easy: just dismount the filesystem, run fsck to check it, and then remount the filesystem after it has been fixed. We have done this on several of our EC2 instances that had inconsistent volumes.

Unfortunately, for these particular volumes we have bonded them together to form a virtual RAID. And this RAID is used as a single multi-TB filesystem which is much bigger than fsck can handle. So we are kind of stuck.

One option would be to newfs the big filesystem, and to move the several TBs of content back into AWS, but that would be very slow. And if there is another power outage......

So instead we called up our pals at Duracloud and asked them if they could help us enable replication of our content to a second provider. (The first provider is - ironically - AWS. But their S3 service, not their EC2/EBS service.) They said they'd be happy to help, and, in fact, they will starting to replicate our content later this same week. (Now that's service!)

The new copy of our content will now be replicated in...... SDSC's storage cloud. This really brings us full circle at ICPSR since our very first off-site archival copy was stored at SDSC. Back then (like in 2008) it was stored in their Storage Resource Broker (SRB) system, and we used a set of command-line utilities to sync content between ICPSR and SDSC.

The SRB stuff was kind of clunky for us, especially given our large number of files, our sometimes large files (>2GB), and our sometimes poorly named files (e.g., control characters in file names). Our content then moved into Chronopolis from SRB, and then at the end of the demonstration project, we asked SDSC to dispose of the copy they had. But now it is coming back......

Wednesday, July 18, 2012

ICPSR web maintenance

We're updating a few pieces of core technology on our web server this afternoon: httpd, mod_perl, Perl, and a few others. Normally we like to perform maintenance like this during off-hours, but we're doing it at 12:30pm EDT today so that we have "all hands on-deck" to troubleshoot and solve problems.

We've already performed this maintenance on our staging server, and that went smoothly. Our expectation is that this maintenance will last 15-30 minutes.

Monday, July 16, 2012

ICPSR 2012 technology recap - where did the money come from?

We're putting together some summary numbers for technology spending and investments at ICPSR for FY 2012. (The ICPSR fiscal year is the same as the University of Michigan's, and runs from July 1 to June 30. We've just recently closed FY 2012.)

The first set of numbers shows the allocation of effort in FY 2012 by funding source. The unit of measurement in this pie chart is HOURS (not DOLLARS) that were expended in FY 2012 by each funding source. (We originally wanted to calculate dollars, but that turns out to be an even bigger effort.) Here's an interactive chart:

This is an interactive Google Docs chart. If you click slices of the pie, it will identify the funding source.

The main source of technology effort funding comes from the Computer Recharge, an hourly "tax" that ICPSR levies against all projects. Although it is one single funding source (nearly 45% of hours worked in FY 2012 were billed against this source), I have split it into two sub-categories, one for what I am calling "IT" and one for "SW" (software).

The "SW" portion includes the effort of all staff who are professional software developers. The type of work performed by this team using this account includes enhancements and maintenance for ICPSR's core data curation and data management systems, and investments in new products and services such as software developed to support our IDARS system for applying for access to datasets.

The "IT" portion includes the effort of the remainder of the staff which tends to include systems administrators, architects, network managers, and desktop support specialists. I also allocate my own time to this bucket since the majority of my non-contract, non-grant effort over the past year has been in building and architecting technology systems.

Other big slices of the "IT pie" include the work of staff members who are explicitly funded by projects such as our CCEERC and RCMD web portals; our two Bill and Melinda Gates Foundation grants; the ICPSR Summer Program, and many more. In fact, there are over 20 separate funding sources used to support technology at ICPSR; the pie chart shows 18 because I grouped several small ones into a category called "Misc."

If this gives the impression that there are many, many projects and activities at ICPSR that involve technology, that's good! That is certainly the case.

However, if "focus wins" then we're in a little bit of trouble. My sense is that each of this 20-some funding sources has at least one unique project with its own business analysis and project management needs, and it is sometimes the case that different projects have antithetical technology needs. I see this play out in all phases of the OAIS lifecycle. ("I want you to build a system that makes it as easy as possible to fetch datasets from ICPSR" v. "I want you to build a system that requires significant effort and oversight to fetch datasets from ICPSR.")

Friday, July 13, 2012

Surviving the move to Google

The University of Michigan is moving its business productivity systems (mail, calendar, among others) to Google this year. Some university institutes and colleges have already made the move, and others will make the move later this year. ICPSR and its parent, the Institute for Social Research, will move in August, although about 70 staff at ISR will make the move early. These "Google Guides" will help others with the transition.

One concept that might be helpful for the move is to distinguish between an email address and email mailbox.

An email address is basically a pointer. It can point to another email address, or it can point to an email mailbox. People publish and share their email address with others, and this is the piece of information we use to target a piece of email.

An email mailbox is a place where email lands. You log in to an email system (Gmail, Exchange, AOL, and many more) with an ID and password, and once there, you can search for messages, read messages, sort, filter, send, and the rest.

The big change at the U-M this year with email is with everyone's email mailbox. It's changing from some legacy system at the U-M to Gmail.

Below is a typical example of how things work today at ICPSR. The @icpsr.umich.edu email address is just a pointer to the U-M enterprise directory entry. That entry - which ends with @umich.edu - is also just a pointer to another email address ending in @isr.umich.edu. And that email address is the thing that actually points to the email mailbox, which in our case is the ISR Exchange server. The first diagram below shows the relationship between email addresses and email mailboxes in the current system for my own email:

Contrast that with the image just above that shows the relationships after the move to Google. There are two main changes.

The first is that the email mailbox now lives in Google Gmail rather than Exchange, and my email software is a web browser rather than Outlook. This is a very big change. Some will find it a pleasant change, and others will hate the new system.

The second is that the roles of the @isr.umich.edu and @umich.edu email address have reversed. The @isr.umich.edu email address is now just a pointer to another email address, and the @umich.edu email address is the one that "points" directly to Gmail. And, of course, just like before, one can publish or use any of the three addresses, and the mail goes to the same email mailbox.

Wednesday, July 11, 2012

Tagging EC2 instances and EBS volumes

Adding this to my Amazon (Web Services) wishlist....

Optional Billing tag which can be set when an EBS volume is created or when an EC2 instance is launched

There is a lot of convenience is having a single AWS account. It makes it easier to find running instances in the AWS Console. It eliminates the need to share AMIs across accounts. It obviates the need to remember (and record) multiple logins and passwords.

However, there is one big win in having multiple AWS accounts: It makes it easier to tie the charges for one set of cloud technology (one account) to a revenue source. And so we often have four or five different AWS accounts for the four or five different projects we have underway.

It would give me the best of both worlds if I could have my single AWS account, but then specify a special-purpose tag (say, Billing) when I provision a piece of cloud infrastructure. This would be an optional tag that I could set I launch an instance or create a volume. This tag would control the format and grouping of charges on my monthly AWS invoice for that account.

For example, say I launch a small instance and set the value of the Billing tag to U12345 (a made-up University of Michigan account number). And then I launch a second one with a Billing tag of F56789. And then in addition to the usual AWS invoice with a line item like this:

AWS Service Charges

Amazon Elastic Compute Cloud
US East (Northern Virginia) Region
Amazon EC2 running Linux/UNIX
$0.080 per Small Instance (m1.small) instance-hour 1440 hours $115.20

I would see an additional section:

AWS Service Charges by tag
U12345

Amazon Elastic Compute Cloud
US East (Northern Virginia) Region
Amazon EC2 running Linux/UNIX
$0.080 per Small Instance (m1.small) instance-hour 720 hours $57.60
F56789

Amazon Elastic Compute Cloud
US East (Northern Virginia) Region
Amazon EC2 running Linux/UNIX
$0.080 per Small Instance (m1.small) instance-hour 720 hours $57.60

This would make it easy for me to take my single invoice from Amazon and "allocate" it (a term from Concur, the system we use for managing this sort of thing) to the right internal account.