Technology at ICPSR: June 2012

Wednesday, June 27, 2012

Video at ICPSR - OAIS and Access

We're taking a pretty close look at Kaltura as the access platform for a video collection we are ingesting. Here's why....

If we look at the Open Archival Information System (OAIS) lifecycle, most of the Ingest work is taking place outside of ICPSR. (In fact, other than providing much of the basic IT resources, like disk storage, our role is very small in this part of the lifecycle.) Managing the content and keeping copies in Archival Storage is a good fit for ICPSR's strengths; the content is in MP4 format and has metadata marked up in Media RSS XML, so that's relatively solid.

The big questions for us are all on the Access side of OAIS. Questions like:

How many of the 20k videos will be viewed on a routine basis? Or ever?
How many people will want to view videos simultaneously?
Will viewers be connected to high-speed networks that can stream even high-def video effortlessly, or will most of the clientele be located on broadband connections? Is adaptive streaming important?
Will support for IOS devices - which do not tend to do well with Flash-based video players - be important?
Can people comment on videos? Share them? Clip them? Share the clips?

I have a requirement from one of our partners to build enough capacity to stream a pair of videos - these are classroom observations and each includes a blackboard video and a classroom video - for up to 1000 simultaneous viewers. That's 2000 videos at a bit-rate (roughly) of 800Kb/s. So maybe about 1.6Gb/s of total bandwidth required at peak.

And I have the same requirement from one of our other partners who is serving a separate audience. So that is a total of 3.2Gb/s. That is a big pipe by ICPSR standards. (Our entire building that we share with others has only a single Gb/s connection to the U-M campus network.)

If we try to build this ourselves we need a pretty big machine with lots of fast disk (20TB+) and lots of memory and lots of network bandwidth. And if we build it too small, the service will be awful, and if we build it too big, we will waste a lot of money and time.

So a cloud solution that can scale up and down easily is looking pretty good as an Access platform.

Next post: Why Kaltura?

Monday, June 25, 2012

Video and ICPSR

I've posted a few times about a large collection of video that ICPSR will be preserving and disseminating as part of a grant from the Bill and Melinda Gates Foundation. I'll devote some time this week to a couple of detailed posts about what we're doing, but one vendor that I'd like to mention briefly today is Kaltura.

Kaltura is a video content management and delivery service that offers both a hosted and on-premise solution. The University of Michigan is entering into a relationship with Kaltura, and I'm serving on a committee which is helping shape that relationship. (More on this later.)

I have early access to Kaltura's hosted solution for video content, and I've used that access to upload a few pieces of public domain content plus some minimal metadata. I then have used Kaltura's tools to assemble combinations of video collections (playlists) and video players, mixing and matching liberally to get a sense for what is possible.

Here's what I have so far:

More on "video @ ICPSR" later this week.

Friday, June 22, 2012

AWS power outage aftermath

As it turns out, it doesn't take all that long to run fsck on a large filesystem comprised of multiple AWS Elastic Block Storage (EBS) volumes:

[root@cloudora ~]# df -h /dev/md0
Filesystem Size Used Avail Use% Mounted on
/dev/md0 4.9T 2.9T 1.7T 64% /arcstore

[root@cloudora ~]# fsck -y /dev/md0
fsck 1.39 (29-May-2006)
e2fsck 1.39 (29-May-2006)
/dev/md0 is mounted.
WARNING!!! Running e2fsck on a mounted filesystem may cause
SEVERE filesystem damage.
Do you really want to continue (y/n)? yes
/dev/md0 has gone 454 days without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes
Error allocating icount link information: Memory allocation failed
e2fsck: aborted

I saw a few posts about coaxing e2fsck to use the filesystem for scratch space rather than memory, but unfortunately the older version of the program available on this EC2 instance does not support it.

So I think that we may end up blowing away this copy of archival storage and replacing it with a fresh one.

Wednesday, June 20, 2012

Amazon power outage and Amazon

Amazon suffered a power outage in their northern Virginia data center last week. Here is my abridged timeline of events from the Amazon Service Health Dashboard:

Jun 14, 8:50 PM PDT We are investigating degraded performance for some volumes in a single AZ in the us-east-1 region.
Jun 14, 10:29 PM PDT We can confirm a portion of a single Availability Zone in the US-EAST-1 Region lost power. We are actively restoring power to the effected EC2 instances and EBS volumes. We are continuing to see increased API errors. Customers might see increased errors trying to launch new instances in the Region.
Jun 15, 12:11 AM PDT As a result of the power outage tonight in the US-EAST-1 region, some EBS volumes may have inconsistent data. As we bring volumes back online, any affected volumes will have their status in the "Status Checks" column in the Volume list in the console listed as "Impaired." You can use the console to re-enable IO by clicking on "Enable Volume IO" in the volume detail section, after which we recommend you verify the consistency of your data by using a tool such as fsck or chkdsk. If your instance is stuck, depending on your operating system, resuming IO may return the instance to service. If not, we recommend rebooting your instance after resuming IO.
Jun 15, 3:26 AM PDT The service is now fully recovered and is operating normally. Customers with impaired volumes may still need to follow the instructions above to recover their individual EC2 and EBS resources. We will be following up here with the root cause of this event.

And, indeed, Amazon did follow-up on the root cause of the problem. Based on the post-mortem that has been reported in several venues, the root cause was a fault in commercial power. And a generator. And an electrical panel. One view is that Amazon got very unlucky with power problems; another view is that they did not test their fail-over thoroughly enough. I lean more to the former view.

ICPSR didn't suffer any outages. For example, our cloud-based replica was available to us the entire time. We did receive notifications from Amazon that specific EBS volumes (basically a virtual block device that may be attached to a cloud-based machine) may have been corrupted, and should be inspected. Amazon included the specific volume. Here's an example notification:

Dear ICPSR Technology ,
Your volume may have experienced data inconsistency issues due to failures during the 6/14/2012 power failure in the US-EAST-1 region. To restore access to your data we have re-enabled IO but we recommend you validate consistency of your data with a took such as fsck or chkdsk. For more information about impaired volumes see:
http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/monitoring-volume-status.html
Sincerely,
EBS Support

So this did create a bit of unscheduled work for the technology team because we had four affected volumes.

One was not attached to anything, and was not in use.

One was attached to a machine we had recently retired.

But two were attached to a machine that stores an encrypted copy of our archival holdings. The volumes are each 1TB and part of a multi-TB virtual RAID. This makes for a very, very long-running fsck to inspect for problems.

I'll have the conclusion on Friday.

Tuesday, June 19, 2012

Brief hiatus ending

Dear loyal readers, sorry for the unexpected brief hiatus in posting. The recent 50th anniversary celebration and accompanying ICPSR Council meeting have kept things hopping lately, and blog posting was one of the casualties.

Monday, June 11, 2012

May 2012 deposits at ICPSR

Stats? Stats.

# of files	# of deposits	File format
1	1	application/msaccess
564	22	application/msword
334	3	application/octet-stream
136	22	application/pdf
4	3	application/vnd.ms-excel
16	1	application/x-arcview
4	1	application/x-dbase
31	7	application/x-sas
281	19	application/x-spss
17	3	application/x-stata
5	3	application/x-zip
18	7	image/jpeg
2	2	message/rfc8220117bit
4	3	text/html
4	4	text/plain; charset=iso-8859-1
125	3	text/plain; charset=unknown
332	20	text/plain; charset=us-ascii
1	1	text/plain; charset=utf-8
9	3	text/rtf
1	1	text/x-makefile; charset=us-ascii

Nothing too interesting this month. We have the usual formats, and in the usual proportions. We did seem to get an unusually large number of MS Word files last month, and we also have a pretty large set of unidentified files (at least in terms of MIME type).

Friday, June 8, 2012

May 2012 web availability

Web availability was good, but not great, in May 2012:

Click to enlarge

Five main episodes of 25 to 59 minutes account for almost all of the 242 minutes of unavailability in May 2012.

On May 13 the production web server seemingly lost power, and it required an additional reboot and some TLC to bring the system back on-line fully (33 minutes).

On May 18 the search index on our CCEERC web portal became corrupted, and that disabled much of the usefulness of the site for nearly an hour (49 minutes).

On May 22 we saw the first of two episodes where the proxy (AJP) between Apache httpd and Apache tomcat faulted. This did not recover on its own and required some help from the technology team. This resulted in a medium-duration outage (27 minutes), and another similar fault occurred on the evening of May 31 (24 minutes).

On May 31 our production database server faulted, requiring a manual power-cycle, and also requiring the production web server to be rebooted (59 minutes).

My sense is that while we're in much better shape with regard to the instability caused by khugepaged, we are starting to see something a little amiss with the Apache proxy system. It isn't clear to us at this time if the issue is faulty software or sub-optimized configuration on our part.

Wednesday, June 6, 2012

ORCID: Open Researcher and Contributor ID

Nature ran a piece recently about ORCID. ORCID stands for Open Researcher and Contributor ID, and has the goal of making it easier to know exactly who is responsible for a given bit of research. ICPSR is a member of ORCID.

The article points out the most prolific researcher (by name) in 2011 is Y Wang who has published nearly four thousand times! And, of course, this is because although Y Wang is a single name, it belongs to many different people. And so how do you know how many of these publications we should credit to any one of those people known a Y Wang?

Having a unique way to identify someone is certainly useful, but it is mainly useful if EVERYONE participates in the scheme. For example, what if all of the people who already have a ResearcherID (such as myself - E-9184-2010) opt out of getting an ORCID too? Or, will either of these be any better than using an email address to identify someone?

Monday, June 4, 2012

Hiring in IT is really tough these days

A recent short piece in NetworkWorld cited a study performed by staffing firm Manpower Group notes that engineers and IT staff are among the hardest positions to fill.

This has certainly been out experience at ICPSR over the past six to eight months. We posted a position in November and another in December and have yet to fill them. Both are more senior positions, one a software developer and the other a systems architect.

My experience is that technology positions at ICPSR are some of the hardest to fill. When ICPSR hires data managers at the entry level, there are usually many good candidates in the pool, including former temps and interns. And when we fill more senior positions in those ranks, it is usually from within (e.g., a promotion from a mid-level data manager to senior-level, or into a supervisory or management positions). But when the technology shop needs to fill a position, it is nearly impossible to hire from within. ("Do you know Java? Or how to manage a firewall?" "No, but I have a background in Political Science and know how to use SAS.")

It could be worse, though. Creating and filling director-level positions in our archives is probably the most difficult job. In this case we're looking for someone who can manage a small team of professionals, manage a relationship with a government agency, and manage a portfolio of grants and contracts. The ideal candidate is also successful in academia, but not so successful that they have no interest in pausing (or greatly shrinking) their research endeavors. And it would also be great if they worked at ICPSR only part-time, keeping an appointment at some other university. And be willing to be in the office in Ann Arbor a few days a week. And....

Friday, June 1, 2012

ICPSR system outage - 5/31/2012

ICPSR's content delivery systems faulted at approximately 8:30pm EDT on Thursday, May 31, 2012. The oncall engineer discovered that the production Oracle database server had become unresponsive, and this disabled most features of most of our web portals.

After arriving on-site she rebooted the database server, but by then the production web server had become hopelessly confused. She then rebooted that system as well, and all systems were back in service a bit before 9:30pm EDT.

The ICPSR technology team is reviewing system logs and access records to see if any further corrective action is required.

Our apologies for the inconvenience this no doubt caused to many of you.

Shorter URLs

I've been using TinyURL (www.tinyurl.com) for many years to generated short versions of longer URLs. One of the things I've liked most about TinyURL is that I can specify part of the URL, which allows me to use something both short and meaningful.

But lately I've switched to Google's URL shortener, http://goo.gl/, instead. It does not let me customize the URL at all, which I miss a little bit, but since it is tied into my Google ID, it gives me nice features like analytics (e.g., link use counts, demographics, etc), a "home page" listing all of the shortened URLs I have created recently, and very short URLs.

But one word of caution: the links and analytics are world-readable, so do not use the Google URL trimmer if you have anything you want to keep semi-private.