Friday, November 16, 2012

Web availability at ICPSR - October 2012

October was a very good month for system uptime - over 99.9% availability:

Click chart to enlarge
That's good news after a much rougher September.  So far things look good this month, although a number of very short-lived outages have already pushed us below 99.9% for the month.

Wednesday, November 14, 2012

A commentary on MOOCs from Clay Shirky

Some of my colleagues - past and present - are attending classes in Massive Open Online Courses (MOOCs).  I've been following their stories and also columnists who have been talking about MOOCs and education.  It is a very interesting time.

Clay Shirky has a long post (Napster, Udacity, and the Academy) about MOOCs that is well worth reading.  Some highlights:
The recording industry concluded this new audio format would be no threat, because quality mattered most. Who would listen to an MP3 when they could buy a better-sounding CD at the record store? Then Napster launched, and quickly became the fastest-growing piece of software in history. The industry sued Napster and won, and it collapsed even more suddenly than it had arisen.

If Napster had only been about free access, control of legal distribution of music would then have returned the record labels. That’s not what happened. Instead, Pandora happened. Last.fm happened. Spotify happened. ITunes happened. Amazon began selling songs in the hated MP3 format.
and
It’s been interesting watching this unfold in music, books, newspapers, TV, but nothing has ever been as interesting to me as watching it happen in my own backyard. Higher education is now being disrupted; our MP3 is the massive open online course (or MOOC), and our Napster is Udacity, the education startup.

We have several advantages over the recording industry, of course. We are decentralized and mostly non-profit. We employ lots of smart people. We have previous examples to learn from, and our core competence is learning from the past. And armed with these advantages, we’re probably going to screw this up as badly as the music people did.

Monday, November 12, 2012

Nick Carr, MOOCs, and ethics

In his post The ethics of MOOC research, Nick Carr describes a note he received from a colleague in academia who comments on the research agenda of Massive Open Online Courses:
The MOOCs’ research agenda seems entirely wholesome. But it does raise some tricky ethical issues, as a correspondent from academia pointed out to me after my article appeared. “At most institutions,” he wrote, the kind of behavioral research the MOOCs are doing “would qualify as research on human subjects, and it would have to be approved and monitored by an institutional review board, yet I have heard nothing about that being the case with this new adventure in technology.” Universities are, for good reason, very careful about regulating, approving, and monitoring biological and behavioral research involving human subjects. In addition to the general ethical issues raised by such studies, there are strict federal regulations governing them. I am no expert on this subject, but my quick reading of some of the federal regulations suggests that certain kinds of purely pedagogical research are exempt from the government rules, and it may well be that the bulk of the MOOC research falls into that category.
Given the intense energy ICPSR has been putting its systems for protecting confidential research data and facilitating requests for using such data, I found this very interesting.

I see parallels here with collecting and using personal information.  If one conducts a survey and asks personal questions to well-consented adults, the results might one day become an interesting, restricted-use dataset.  But if the same information is harvested from freely and openly blogs, tweets, and wall posts, would it also become restricted-use data?

Monday, October 15, 2012

Artificial Intelligence as defined by Nick Carr

Nick Carr has a short post here marking the occasion of Facebook's one billionth member.  He goes on to talk a bit about some work at Google on neural nets, but then includes this gem on artificial intelligence:
Forget the Turing Test. We’ll know that computers are really smart when computers start getting bored. If you assign a computer a profoundly tedious task like spotting potential house numbers in video images, and then you come back a couple of hours later and find that the computer is checking its Facebook feed or surfing porn, then you’ll know that artificial intelligence has truly arrived.
It's a short post and a good read.

Wednesday, October 10, 2012

September 2012 deposits at ICPSR

The numbers from September are in:


# of files# of depositsFile format
291F 0x07 video/h264
14517application/msword
51application/octet-stream
29211application/pdf
145application/vnd.ms-excel
11application/vnd.ms-powerpoint
1201application/x-arcview
311application/x-dbase
11application/x-rar
246application/x-sas
146application/x-spss
135application/x-stata
55application/x-zip
11image/jpeg
321image/x-3ds
702multipart/appledouble
103text/plain; charset=unknown
6218text/plain; charset=us-ascii
21text/rtf
292text/xml

Interesting month in that we have the usual stuff in the usual quantities, but we also have a large number of unusual formats hitting the doorstep, such as ArcView and Apple Double.  And we also have a usual format in an unusually high quantity (MS Word).

Friday, October 5, 2012

September 2012 web availability

September was an OK, but not great month for web availability:


Click to enlarge
We eliminated one frequent, but short-lived source of downtime when we stopped exporting the content of our Oracle database nightly.  We are now doing it only on the weekend, and while that adds some risk, we're gaining significant uptime.  (For some reason that we do not understand, our Oracle instance stops answering queries for 15-20 minutes about ten minutes AFTER the export completes.)  We have a new server racked and ready to install, and we're hoping that a fast new machine with solid-state drives will solve the problem for us.

We did run into some trouble mid-month when some routine maintenance went awry, and we had to fail over to our replica over the weekend of September 15 and 16.  The total amount of downtime was about 90 minutes total over the course of the weekend, but the replica kept the problem from clobbering our service completely.

After that we had pretty smooth sailing for the rest of the month.  Just 16 minutes of downtime for the rest of the month.

Wednesday, October 3, 2012

Setting up Kaltura - part VI

We'll focus on the Kaltura Drop Folder feature today.  The Drop Folder offers a mechanism whereby an enterprise can bulk upload content without human intervention.  In principle this is an excellent way for a library or archive to ingest many objects into Kaltura without some poor archivist performing individual (or group) uploads via a web GUI.  In practice the mechanism works smoothly when things are going well, but it can be a little difficult to diagnose problems when things go awry.

For example, here's a sample display from the Drop Folders panel from our Kaltura Management Console (KMC), which serves as an all-in-one dashboard for managing content:

Click to see a full-size image

According to this display we have just ingested three items:  an XML file and two video files.  In this particular case the XML file contained all of the metadata for the two video files, and contained instructions that told Kaltura that these were new items to ingest.  The Status field shows a value of Done, and the Error Description field is empty.  This seems good.

We can also see status information if we navigate to the Upload Control panel and select the Bulk Upload view.  Here we see similar info:

Click to see a full-size image
Again, this seems like good news.  The Notification column shows a value of Finished successfully.  Hooray!

But not so fast, my friend....

If we examine one of the video files under the Content panel (Entries tab), we see that none of the extended metadata is present.  We can see the Custom Data fields, but they are all empty.  Hmm, what happened?

If we navigate back to the Upload Control display, the last column offers some possible help:


There is an Action available to download a log file.  That sounds promising.  Let's do that.

The log file is in XML format, and if we open it up in a good browser or text editor or XML editor, we find XML that looks very much like the ingest XML we used in the Drop Folder.  And if we scroll all the way down to the bottom, we find this snippet:

<item><result><errorDescription>customDataItems failed: invalid metadata data: Element 'METXVideoSubmissionElectronicBoardUsed': [facet 'enumeration'] The value ' ' is not an element of the set {'Y', 'N'}. at line 87 Element 'METXVideoSubmissionElectronicBoardUsed': ' ' is not a valid value of the local atomic type. at line 87 </errorDescription></result>
This is telling us that we messed up one of the metadata fields.  If we look at the original ingest XML and find the statement that is supposed to be setting METXVideoSubmissionElectronicBoardUsed, sure enough, there is no value.  (The error occurs on line 296, not 87, which is a bit confusing.)

So the good news is that if we notice the error, we can find a log that will point us at the error.  But detecting the error is a little tricky, and it is easy to see how this would be difficult if we were ingesting, say, 100 items at a time.  So this is not awful, but is also not quite as nice as we might like.

Suggestions:
  1. If the XML contains Custom Data, and the Custom Data has errors, but the video still ingests, perhaps a Status of something like "Done with errors" (in the Drop Folders display) or "Finished with Custom Data errors" (in the Bulk Upload Log display).
  2. Make the diagnostic message (errorDescription) available without needing to download a file.  This could appear in a new column, or perhaps in a text pop-up.
  3. If N - 1 elements of the Custom Data are good, but one is bad, it would be nice if the other N - 1 Custom Data fields are set.  That would make it possible to correct the error manually in the KMC rather than copying fresh XML into the Drop Folder.
  4. Suppress the line numbers since they are relative to the log file XML, not the original XML.
Again, overall the Drop Folder feature is very nice, and we will indeed use it to ingest the 20-30,000 video files in our collection.  But since it is likely that we will sometimes make a mistake within the XML (say, forgetting to escape a certain character), it would be great if the KMC would make it hard to detect and diagnose mistakes.

Monday, October 1, 2012

ICPSR Director of Curation Services

We have interviewed all of the candidates.
See http://dilbert.com/strips/comic/2011-10-30 for the full cartoon
I had the opportunity to meet with several of the candidates, and we have several excellent ones.  With a little bit of luck we should be able to announce who will be filling the position sometime soon.

And then s/he can explain exactly what Curation Services are. :-)

Friday, September 28, 2012

Be careful when answering, "Both"

When I am working with someone to work through the requirements of a new system or project, I will often ask a series of questions that help shape my understanding of what the person wants.  Often these fall into a pattern where I ask a series of either/or questions, like this:

Do you want it optimized for security or ease of use?

Does this fit into a wholesale or retail delivery paradigm?

Will this be used by external customers or internal staff?

Is this intended to drive new revenues or decrease current costs?

In many ways this is like going to the ophthalmologist who has you look through lens A and then lens B, and then asks the question, "Which was better, A or B?"  Both of us are trying to bring the problem into focus.

The single most dreaded (by me) answer to these questions is: Both.

In some cases this answer really means, "I am not sure what I want."  Or, "I'm too busy to think about this."  Or, "I don't care."

This, obviously, does not help when gathering requirements.  And so it is a real barrier to scoping the project. Sometimes, of course, an answer of Both is a fine start to a longer answer.
We really do need both in this case.  We want to build a system for managing metadata that can be used by both the staff and external people equally easily.  We are changing our entire workflow so that either population can manage our metadata, and this is our new business practice.
That is a fine use of Both.  In fact, if the person gave a different answer, we might needlessly limit the usefulness of the system we build.

And another fine answer, just like we sometimes tell the ophthalmologist is I don't know.  There's nothing wrong with that answer.  However, just like with the ophthalmologist, when I hear this answer, I reach into my bag of lens, and try another pair to bring the issue into focus.

Wednesday, September 26, 2012

Kaltura pilot at the U-M and ICPSR

The University of Michigan in-house newspaper/newsletter ran a nice piece on the Video Contement Management pilot (using Kaltura), which includes our Bill and Melinda Gates Foundation project:  Measures of Effective Teaching - Extenstion.

Here is the part about us:
• ISR and the School of Education are engaged in a collaborative research project in which a large collection of video assets is a primary data-type. This data will become a shared repository available to research partners at universities, public agencies, and private foundations.
The timing here was perfect for us since we were in the market for a system to manage and stream about 20TB of video to thousands of simultaneous users.

Monday, September 24, 2012

DuraSpace announces SDSC as storage partner

DuraSpace announced recently their relationship with SDSC as a storage provider for DuraCloud.  As I posted a while ago, we have been using both DuraCloud and their SDSC storage partner for a while.  It's great to see DuraCloud continue to grow.

Friday, September 21, 2012

Setting up Kaltura - part V

In this post we will look at the XML we use to ingest content into Kaltura through its Drop Folder mechanism.  To re-cap an earlier post about the Drop Folder, ours is a subdirectory under the home directory of the 'kaltura' user.  We provisioned this account on a special-purpose machine that Kaltura accesses via sftp to fetch content without human intervention.

Kaltura has a pretty nice guide to building the XML, which looks an awful lot like Media RSS.  And they also make some short examples available, but we always find it useful to have a real-world example.  Here's ours.

Preface: Our stuff is a little unusual.  That is:
  1. We always have pairs of videos, one classroom and one blackboard
  2. We have lots of metadata and it applies to both videos, and so the metadata gets repeated in the XML.  I have excised much of the metadata in our XML for this post
  3. All of the metadata is fake; it is not real metadata about an actual classroom video
  4. You can find a copy of the XML that we diagram below at this URL http://goo.gl/OxbyU
  5. Our use of Kaltura is in support of the Measures of Effective Teaching (Extension) project, and so there are many references to 'metext' in the metadata
  6. We will be generating the XML for ingest programatically
Here goes.

 <?xml version="1.0"?>  
 <mrss xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="ingestion.xsd">  
     <channel>  
         <item>  
             <!-- These do not change -->  
             <action>add</action>  
             <type>1</type>  
             <!-- This changes for each video -->  
             <referenceId>12345-Board-video-rendition-MP4-H264v-Standard.mp4</referenceId>  
             <!-- This does not change -->  
             <userId>metext</userId>  
             <!-- This changes for each video -->  
             <name>RCN 12345 board video</name>  
             <!-- This changes for each video -->  
             <description>RCN 12345 board video</description>  

We have the usual XML stuff at the beginning, and then the start of the Media RSS.

Action = add for new content.

Type = 1 for video content.

ReferenceID is the name of the original file.

UserID is the pseudo-user who will be linked to the content in Kaltura.

Name and Description are exposed as base metadata in Kaltura.

             <!-- Always assign two tags, one called metext and the other board or classroom -->  
             <tags>  
                 <tag>metext</tag>  
                 <tag>board</tag>  
             </tags>  
             <!-- This does not change -->  
             <categories>  
                 <category>metext</category>  
             </categories>  
             <!-- This does not change -->  
             <media>  
                 <mediaType>1</mediaType>  
             </media>  

We tag everything with the project name and indicate whether it is a video of the blackboard or the classroom.

Kaltura uses Categories as a main way to browse and find content.  We treat this as if it were a type of "is in collection" sort of attribute.


MediaType = 1 for video.

             <!-- This changes for each video -->  
             <contentAssets>  
                 <content>  
                     <dropFolderFileContentResource filePath="12345-Board-video-rendition-MP4-H264v-Standard.mov"/>  
                 </content>  
             </contentAssets>  

This tells Kaltura that the video file is in the Drop Folder along with the XML.

Now for our project-specific metadata, which fits into a Kaltura structure called Custom Data:

             <!-- This changes for each video -->  
             <customDataItems>  
                 <customData metadataProfileId="22971">  
                     <xmlData>  
                         <metadata>  
                             <METXDistrictDistrictName>Ann Arbor</METXDistrictDistrictName>  
                             <METXDistrictDistrictNum>20</METXDistrictDistrictNum>  
                             <METXSchoolSchoolName>Huron High School</METXSchoolSchoolName>  
                             <METXSchoolSchoolMETXID>33</METXSchoolSchoolMETXID>  
                             <!-- Kaltura wants the date to be in xs:long format, that is, seconds from the epoch -->  
                             <METXVideoSubmissionCaptureDate>1344440267</METXVideoSubmissionCaptureDate>  
                         </metadata>  
                     </xmlData>  
                 </customData>  
             </customDataItems>  

The ID attribute is from our Kaltura KMC.  I had to create the Custom Data schema first, and then reference it in the ingest XML here.

Most of the metadata fields are simple strings or strings from a controlled vocabulary.  We do have one date item, and sadly Kaltura expects it to be in a difficult-to-use format, seconds since the epoch.

After this section of the XML is a closing tag for item, and then the whole thing repeats with only minor variation for the classroom video.  I'll include it below for completeness.


         </item>  
         <item>  
             <!-- These do not change -->  
             <action>add</action>  
             <type>1</type>  
             <!-- This changes for each video -->  
             <referenceId>12345-Classroom-video-rendition-MP4-H264v-Standard.mp4</referenceId>  
             <!-- This does not change -->  
             <userId>metext</userId>  
             <!-- This changes for each video -->  
             <name>RCN 12345 classroom video</name>  
             <!-- This changes for each video -->  
             <description>RCN 12345 classroom video</description>  
             <!-- This changes for each video -->  
             <tags>  
                 <tag>metext</tag>  
                 <tag>classroom</tag>  
             </tags>  
             <!-- This does not change -->  
             <categories>  
                 <category>metext</category>  
             </categories>   
             <!-- This does not change -->  
             <media>  
                 <mediaType>1</mediaType>  
             </media>  
             <!-- This changes for each video -->  
             <contentAssets>  
                 <content>  
                     <dropFolderFileContentResource filePath="12345-Classroom-video-rendition-MP4-H264v-Standard.mov"/>  
                 </content>  
             </contentAssets>  
             <!-- This changes for each video -->  
             <customDataItems>  
                 <customData metadataProfileId="22971">  
                     <xmlData>  
                         <metadata>  
                             <METXDistrictDistrictName>Ann Arbor</METXDistrictDistrictName>  
                             <METXDistrictDistrictNum>20</METXDistrictDistrictNum>  
                             <METXSchoolSchoolName>Huron High School</METXSchoolSchoolName>  
                             <METXSchoolSchoolMETXID>33</METXSchoolSchoolMETXID>  
                             <!-- Kaltura wants the date to be in xs:long format, that is, seconds from the epoch -->  
                             <METXVideoSubmissionCaptureDate>1344440267</METXVideoSubmissionCaptureDate>  
                         </metadata>  
                     </xmlData>  
                 </customData>  
             </customDataItems>  
         </item>  
     </channel>  
 </mrss>  

And that's it.

Wednesday, September 19, 2012

An Inconvenient Outage

Some of you probably noticed that we had a rough weekend with the web site.  We first saw trouble around 4pm EDT on Saturday.  After some trouble shooting and investigation left us unsure of the root cause, we failed things over to our replica around 5pm EDT.  We then ran off the replica over night and through the following morning.

The big breakthrough came at 1:30pm or so Sunday when we isolated the cause, and then it took only a few minutes to correct the problem, test the solution, and finally roll service back to the production site. As with any longer outage this one pointed out a bunch of small, but important, changes to make in procedures and documentation.

My apologies if you happened to be using our web site late afternoon on Saturday; the was certainly the roughest time.

Monday, September 17, 2012

Introducing ICPSR's Virtual Data Enclave (VDE)

The ICPSR Virtual Data Enclave (VDE) is a secure, virtual environment in which a researcher can analyze sensitive data, create research products, and then take possession of those products and analysis.  And while he VDE is not a substitute for a physical enclave and the types of security protocols it facilitates, the VDE is very much a potential substitute for the traditional practice of distributing confidential data via removable media, such as CD-ROMs.

The VDE uses much of the same technology that ICPSR uses internally for its Secure Data management Environment (SDE) which we have described a few times.  In brief, we use a virtual desktop environment that is operated by the University of Michigan's central IT shop and connect it to what we call our Private Network Attached Storage (NAS) appliance.  Both the virtual desktop and NAS are behind a firewall, and we use the firewall and Windows group policies to restrict what actions one pay perform.  Download?  Nope.  Cut-and-paste between the virtual desktop and the real desktop? Uh uh.  Capture screenshots by taking a picture of your monitor?  Well, ......

The virtual environment keeps sensitive datasets under lock and key at ICPSR, but makes it available to researchers.  The environment contains the usual array of applications used in the social sciences (but no email!), exactly the same sort of stuff we might set up for a visiting scholar or OR.

The researcher accesses the environment through a small, easy-to-download and -install client based on VMware View Client.  Authentication takes place using standard University of Michigan credentials which we (ICPSR) and others at UMich can issue to "friends."  Access between the real desktop and the virtual desktop is encrypted, and we are in the process of adding IPSEC encryption between the virtual desktop and the NAS.  (This latter traffic passes over UMich's data backbone, and access to those routers is limited to UMich central IT network engineers.)

The virtual machine is completely ephemeral and can be wiped after each use.  Any intermediate research or results are stored on the ICPSR NAS.  Our NAS is backed up weekly, and tapes are cycled off-site quarterly.  Once the research has been completed ICPSR retains a "just in case you need it" snapshot for up to three years.

Friday, September 14, 2012

Setting up Kaltura - part IV

We have been working on getting a Kaltura Drop Folder set up.  A Drop Folder is a mechanism where an organization spools content to be ingested in a fixed location, and Kaltura polls the location, watching for content to ingest.

In our case it has taken about a month to get the Drop Folder configured, and much of this delay is preventable if you avoid the same pitfalls we did.  So in the spirit of giving back to the community, here are seven things to know when setting up a Drop Folder.

  1. Host the Drop Folder yourself, do not host it at Kaltura.
  2. Set up an account and a password on the machine, and share them with Kaltura.  To keep things very simple I created an account called 'kaltura'.
  3. Create a subdirectory under the kaltura user's home directory that will actually contain the content to be ingested.  To keep things very simple I used the name 'dropfolder'
  4. Make sure that the kaltura user owns the Drop Folder directory, and that its access controls grant appropriate rights to other users that may need to ingest content
  5. Tell Kaltura the name of the machine.  To keep things very simple I created a DNS CNAME record, kaltura.icpsr.umich.edu, that points to the right machine.
  6. Be sure you have ssh installed and running on port 22 on the machine.  If you normally do not run ssh on port 22 (we don't), don't forget to open a hole in your firewall so that Kaltura machines can reach the Drop Folder.
  7. Tell Kaltura to use sftp and port 22 to connect to your Drop Folder.  Do not try to use a port other than 22.
To re-cap the values we used:
  • Host: kaltura.icpsr.umich.edu
  • Protocol: sftp 
  • Port: TCP 22
  • Login: kaltura
  • Password: XXXXXXXX
  • Drop folder: dropfolder 
  • Drop folder UID:GID:  kaltura:met  
  • Drop folder mode: 2775
(We have a big video project called MET, and automated jobs running with the 'met' GID will need write access to the drop folder.)

You may be tempted to suggest using ssh keys or non-standard ports for ssh.  Fight those temptations.

Kaltura will offer to auto-delete content once it has been ingested.  Accept that offer.

Know that when you delete items from the Drop Folder status window in your Kaltura KMC it will also delete them from the Drop Folder.  This is not obvious, but turns out to be useful.

Now all you need are automated jobs that place content and Kaltura-style Media RSS XML into the Drop Folder.  Kaltura has some nice examples on-line, but they are somewhat trivial.  We'll post some more complex, real-world examples next week.

Friday, September 7, 2012

August 2012 deposits

Light month for deposits:

# of files# of depositsFile format
5630application/msword
6926application/pdf
98application/vnd.ms-excel
21application/vnd.wordperfect
41application/x-dosexec
247application/x-sas
4721application/x-spss
72application/x-stata
262application/x-zip
11image/gif
184image/jpeg
11message/rfc8220117bit
33text/plain; charset=iso-8859-1
1710text/plain; charset=unknown
329text/plain; charset=us-ascii
32text/plain; charset=utf-8
397text/rtf
11text/x-mail; charset=iso-8859-1
42text/x-mail; charset=unknown
22text/x-mail; charset=us-ascii
11text/xml

Just the usual stuff, but in pretty low quantities.

Wednesday, September 5, 2012

ICPSR web availability - August 2012



August was not our best month.

We did a bit better than 99.3% uptime.  Almost all of the downtime is due to a recurring, as-yet-unsolved problem we are having with our Oracle database platform.  The primary symptom is that the database platform stops fielding queries for about 5-15 minutes, which disables our production web site.  The platform does this about 30-45 minutes AFTER it has finished a full export using the Oracle datapump system.

Because our existing Oracle hardware is old and has a relatively slow disk I/O system, we're going to try to solve this problem by throwing hardware at it.  For well under $10k we can replace our five-year-old hardware with something much newer.  Goodbye RAID-5 SCSI, hello SSD.


Wednesday, August 29, 2012

Setting up Kaltura - part III

Some good news and some bad news today.

The good news is that I've tested XML Ingest via the Drop Folder feature, and it seems to work very well.  I was able to upload two videos, add their extensive metadata, and get it working well after fixing a simple (but dumb) typo I made in the XML.  In terms of creating the right XML content, we are in great shape.

The bad news is that I've run into a couple of snags with the Kaltura Software as a Service (SaaS) offering.

The first is actually with the Drop Folder service.  The current solution we are using is where Kaltura hosts the Drop Folder and we use sftp with a password to transfer a bundle of content - Kaltura-customized Media RSS XML plus a pair of video files.  However, what we really need is a locally hosted Drop Folder (which costs less to operate) and a way for Kaltura to fetch the content.  So far Kaltura hasn't been able to make this work.  We had hoped that they could use an ssh key-pair (private at Kaltura and public installed in $HOME/.ssh/authorized_keys) for access, but at this time, Kaltura does not support ssh keys.  So we are kind of stuck waiting for ssh key-pair support to appear, or we are stuck uploading content via sftp and typing passwords (i.e., a manual solution).  Ick.

The second issue is around counting bits.  Kaltura charges by the bit - storing them and streaming them.  That makes a lot of sense, and is actually fine by us.  However......  so far we are not seeing reports or analytics that report usage by the bit, only by the minute.

In a case where one has a single collection with a single delivery platform operated by a single administrative unit, this may work quite well.  The monthly bill from Kaltura goes to one place, and the analytics are very useful for reviewing what's getting played, how often, how much, etc.

But in a case where one has a single collection (like Measures of Effective Teaching) with multiple delivery platforms operated by multiple administrative units, this will prove problematic.  In this scenario we really need a bill that breaks out usage like this:

  • Bits stored for the month - XX GB
  • Bits streamed by delivery platform A - XX GB
  • Bits streamed by delivery platform B - XX GB
  • Bits streamed by delivery platform X - XX GB

Then the partners could carve up responsibility for the bill.  For example. maybe ICPSR pays for ALL of the storage and for the bits streamed by its delivery platform, the School of Education pay for the bits streamed by its delivery platform, and a future partner-to-be-named pays for the bits it streams.

We had hoped to start using Kaltura in September, but my sense is that the Drop Folder issue will push this back.  But the issue around counting is the big one, and I don't have a sense yet for whether this is easy to address or very difficult to address.

Friday, August 17, 2012

Setting up Kaltura - part II

I mentioned in the last Kaltura post that we've set up a Custom Data schema to hold the descriptive metadata for our video content.  Setting up that schema took some considerable effort, and I thought I might share some of the details with this post. For context we are using the newly release Falcon edition of the Kaltura Management Console (KMC) web application.

One creates a Custom Data schema by using the Settings menu in the KMC, and then by navigating to the Custom Data tab.  A button allows one to Add New Schema, and I happened to give ours the name of "MET Extension" since that is the name of the project generating the video.  I did not set a System Name for the schema, and while Kaltura said that was required, the KMC does not enforce it, and lack of a System Name has not yet proven to be a problem.

One adds fields/elements to the schema one at a time using a pop-up window, Edit Entries Metadata Schema.  This can be seriously laborious if you have a large schema like mine with 40 elements.  Lots of cutting and pasting.  Kaltura allows one to export the schema as an XSD XML file, but one cannot import such a file to create or update the schema.

Elements can be Text, Date, Text Select List, or Entry-id List.  Each can be single or multi-value, and each can be indexed for search (or not).  The KMC allows one to supply both a short and longer description for each element.

Text fields are exactly what you would expect, and Text Select Lists are basically pick-lists.  The Entry-id field is useful if you want to store the ID of an extant Kaltura object.  Date is a little tricky since the format one uses to supply a value for this field works one way in the KMC interface - conventional calendar format or pick from a calendar widget - but a very different way when ingesting via XML where it expects an integer number of seconds since the epoch.

We will be ingesting tens of thousands of videos into Kaltura, and so we will NOT be using the KMC to upload videos and to compose metadata.  Instead we will be using their Drop Folder mechanism where one puts XML metadata (including pointers to video files) in a special-purpose local location to which Kaltura has access.  Preparing the XML content that includes both the descriptive and technical metadata is our current project, and I'll report on that process - and the Drop Folder process - next.

Friday, August 10, 2012

July 2012 web availability at ICPSR

July 2012 was a pretty good month:

Clicking the image will open a larger, easier-to-read chart.

We only has 36 minutes of downtime in July, and 28 were due to maintenance as we tried (but failed) to update apache, perl, and mod_perl on our production web server.  We discovered some interesting idiosyncrasies in some perl libraries during the maintenance.  (Summary:  Multi-word time zones like "New York" are trouble.)

Wednesday, August 8, 2012

July 2012 deposits at ICPSR

The totals for July 2012:

# of files# of depositsFile format
11F 0x07 video/h264
52011application/dicom
11application/msaccess
11223application/msword
1214application/octet-stream
24141application/pdf
65application/vnd.ms-excel
11application/vnd.ms-powerpoint
11application/x-7z-compressed
441application/x-arcview
831application/x-dbase
733application/x-dosexec
51application/x-empty
113application/x-sas
11application/x-shellscript
15021application/x-spss
236application/x-stata
284application/x-zip
100245image/jpeg
81image/png
41image/x-ms-bmp
22message/rfc8220117bit
151multipart/appledouble
55text/html
11text/plain; charset=iso-8859-1
43text/plain; charset=unknown
15738text/plain; charset=us-ascii
74text/plain; charset=utf-8
74text/rtf
43text/x-mail; charset=us-ascii
434text/xml
251video/unknown


Lots of image content this month to go with the usual stuff (e.g., SPSS) in the usual volumes.

Monday, August 6, 2012

Setting up Kaltura - part I

I've mentioned in previous posts that the University of Michigan is implementing Kaltura as its video content management solution.  Kaltura is an open-source video platform that one can install and operate locally, and is also available in a software as a service (SaaS) version.  The U-M is making use of the SaaS edition, and ICPSR is one of three pilot testers.

The off-the-shelf web application provided by Kaltura to manage content, collect analytics, publish content, create custom players, set access controls, etc. is called a Kaltura Management Console (KMC, for short).  A major question for any enterprise using Kaltura is:  How many KMCs do we need?  The answer is:  Just enough, and not one more.

It is very difficult to share content between KMCs, and so there is a major incentive to have the smallest number of KMCs, perhaps only a single one.  However, it is also difficult to "hide" content from others who are sharing the same KMC, and that can cause concerns about privacy, access control, and inadvertent use (or mis-use) of content.  In my mind giving someone an account on a KMC is like giving someone root access on a UNIX machine.

The solution we used at the U-M was to deploy two KMCs for now.  One is for two types of content:  video which is generally available to the public, such as promotional materials, and video which is used in courses via our local Sakai implementation, CTools.  We provisioned a second KMC for ICPSR to use for its content, which falls more into the "research data" category.  This content will require signed agreements for access.

Once Kaltura provisioned our KMC I performed a few initial house-keeping chores:

  1. Created accounts (Authorized KMC Users) for my colleagues on the project.  Each has a Publisher Administrator role.  (Administration tab in the Falcon release of the KMC.)
  2. Changed the Default Access Control Profile to require a Kaltura Session (KS) token.  All content managed by this KMC should require a KS token by the player.  (Settings - Access Control tab)
  3. Created a new Access Control Profile (called Open) which does not require KS.  I don't know if I will need this, but want to have a more open profile available.
  4. Changed the Default Transcoding Flavors to (only) "Source."  Our content has already been transcoded, and so we don't need to pay for the time and storage for additional flavors such as HD, Editable, iPad, Mobile, etc.  (Settings - Transcoding Settings)
  5. Created a Custom Data Schema to hold the extensive descriptive metadata that accompanies the content generated in our project (MET Extension).  This step is extraordinarily tedious since it has to be done field-by-field through a web GUI.  I can download a copy of the schema I created in XSD format; wish I could upload one to create it.  (Settings - Custom Data)
  6. Created a slightly customized player for use with our content.  Wanted to size it to fit our content, remove the Download button, etc.  This is super easy.  (Studio tab)
  7. Created a Category which we will use to "tag" our content.  (In this case I created one called MET-Ext.)  This is mostly useful for searching and browsing within the KMC interface.  (Content - Categories tab)
  8. Uploaded a few videos and set a few of the metadata fields.  (Content - Entries)
  9. Put in a request to our account manager to enable a locally hosted Drop Folder.  This is a mechanism whereby we create a local "fetch" location where an automated Kaltura job can pull content and ingest it.  While one would think that this is a common mechanism for submitting content, the process is slow and poorly documented, and cannot be managed via the KMC.  I'll post more details about the process once I have a working Drop Folder in place.
  10. Created the local infrastructure for the Drop Folder which is really just identifying a machine to play host, and then creating an account Kaltura can use.

These steps got us to the point where we could start putting Media RSS files containing the metadata and pointers to the video into our Drop Folder for ingest into Kaltura.




Friday, August 3, 2012

How can I put a meeting "out to bid?"

At the University of Michigan if one wants to spend $5,000 to buy a small rack-mount server with a five-year service life one needs to put the request out for bid.  We then need to justify the vendor selected, and if we choose a bid which is NOT the lowest price, we have even more explaining to do.

It makes sense that the U-M would want to make sure that major purchases receive the right level of oversight and review.  One can argue if the $5,000 limit is the right number, but it seems like some number (maybe a higher number?) is the right one.

However, if one wants to schedule a one-hour recurring meeting with nine other people for a monthly meeting for five years (so 60 meetings total with 10 total people at each meeting), the only barrier is sending out the invitation and getting people to attend.  Now, of course, if one is a very senior-level person inviting direct reports or dotted-line reports, it is pretty easy to get people to attend.

Each meeting might cost around $500 in staff time; perhaps closer to $1000 if the people are senior (expensive) and we include benefits and any other hourly fees.  And so 60 such meetings will cost the U-M somewhere between $30,000 and $60,000 over the couse of five years.

So here's the question:  If the goal is to make sure that the U-M is spending its resources wisely, who's making sure that the resources spent on meeting are used wisely?

Meetings - not hardware, software, licenses, cloud computing, paper, printers, etc - are the biggest expense we have.

Wednesday, August 1, 2012

ICPSR 2012 technology recap - where did the money go?

A post from a week or so ago showed the sources of money flowing into the technology organization at ICPSR.  This week's post will focus on how that money is spent.

I should note that the focus of this post is on what we call the "Computer Recharge" at ICPSR.  This is a tax paid by all FTEs at ICPSR that is levied on an hourly basis.  Each time an employee completes his or her timesheet and allocates time to a project (exceptions: sick, holiday, and vacation time), a small amount of money also accumulates in the "Recharge."  In FY 2012 this accounted for about 45% of the technology revenue.

Unlike a direct charge for technology where a project or grant may be paying for a dedicated server, extra storage space, or a fraction of a software developer working on custom systems, the Computer Recharge dollars are used to fund technology expenses that benefit the entire organization.  This includes expenses as prosaic as printers and desktop computers, but also includes systems and software development for our delivery systems, ingest systems, and digital preservation systems.

Here's the breakdown in chart form:

No surprise that most of the money goes to pay people:  systems administrators, desktop support specialists, software developers, and a group of very hands-on managers who do a lot of the same work plus project management and business analysis.  This accounts for $862k out of $1218k.

Equipment ($214k) represents desktop machines, new storage capacity, printers, virtual machines, cloud storage, software licenses, and almost every non-salary expense.

Transfers ($103k) is an interesting category.  This is money that we collect as major systems depreciate.  But since the U-M doesn't really "do" depreciation, we use this interesting process instead:


  1. Buy the item (using money from Equipment pot)
  2. Estimate the lifetime of the item (say, five years)
  3. Collect 1/5 of the purchase price each year for five years
  4. At the end of each year, move the 1/5 collected into Transfers
  5. When the item needs to be replaced, use the money in the Transfers pot


So the $103k represents money that was collected in FY 2012 for items that were purchased in earlier years, and which will need to be replaced in FY 2013 or beyond.

All of our other expenses are tiny by comparison:  $19k for Internet access, $7k for telephones and service monitoring, $6k for travel, $5k for maintenance contracts, and $2k for miscellaneous fees and expenses.

One take-away from this is that the essential element in technology budgeting isn't the purchase price of the server, or the annual cost of cloud storage, bur rather the recurring cost of the people who will build, maintain, enhance, and customize the technology portion of the business.

Wednesday, July 25, 2012

My Amazon Web Services wishlist for the U-M

I'm serving on a University of Michigan task force that is looking at ways in which we can make cloud computing easier for faculty, students, and staff to consume.  This presupposes, of course, that at least some of the university community have research or business or other needs that would be well served by a cloud-type solution.

For those of us that are already using the cloud to solve a few different problems -- off-site archival copies, disaster recovery solution for delivery systems, among others -- the problem isn't so much how to get us to use cloud computing, but how the U-M can help us get the most value for our dollar.


With this in mind I offer my Amazon Web Services (AWS) wishlist for the U-M:

  • Build Amazon Machine Images (AMI) for 64-bit Red Hat Linux (and, optionally, 64-bit Windows Server). Put any security or system or software goodies into the image that would be available to the entire university community (IT directors, grad students, casual users). This saves us from needing to build and maintain our own AMI or, worse, using one from a third-party.
  • Deploy an AWS Virtual Private Cloud (VPC) that connects our own little piece of AWS “cloud space” to the rest of campus over a secure link. Allow instances running within this VPC to access infrastructure such as Active Director. Treat this part of AWS as if it were just another network (or data center) on campus. This enables us to deploy services dependent upon campus infrastructure in AWS more easily.
  • Deploy an AWS Direct Connect between the VPC and UMnet (or Merit [the State of Michigan research and education network] or Abilene [Internet2's national network]). This grants us a fast, secure, inexpensive pipe for moving content between campus and AWS. We could start to deploy I/O-intensive resources in AWS more readily if we don’t have to pay for the bits individually.
  • Implement an agreement where AWS has one customer (the University of Michigan) rather than many. (ICPSR alone has four different identities within AWS, largely so that we can map expenses from one identity to a university account.) This one customer would have different sub-accounts, and the usage across ALL of the sub-accounts would roll up to set pricing. ICPSR stores over 1TB of content in AWS S3 for example, and so our GB/month rate is $0.11. Other uses at U-M who store content in AWS S3, but less than 1TB, are paying over $0.12/GB/month. That is only a small amount more than $0.01/GB, but it adds up over ALL accounts each month.
  • Explore the feasibility of allowing one to use U-M credentials (via Shibboleth?) to access key web applications at AWS, such as the AWS Management Console. We currently have to provision a separate email address and local (to AWS) password.
  • Explore the feasibility of using an AWS Storage Gateway as a means to deliver additional storage needs for bursty or short-lived storage needs. It would be fabulous if we could buy nearly unlimited space in the U-M storage cloud. This is more feasible if we can use AWS storage for short-lived "bursts" of temporary storage.

Monday, July 23, 2012

ICPSR web site maintenance

We ran into a few problems last Wednesday during our system update.  We rolled back the changes, and are giving it another go this evening at 10pm EDT.  We're going to move traffic to our replica in the cloud during the maintenance so that we have more time for troubleshooting.

Our cloud replica has many of the features of our main site (search, download, analyze), but does not include features that transfer materials to ICPSR, such as our online Deposit Form.

Friday, July 20, 2012

Amazon's loss is SDSC's gain

One of the recent Amazon Web Services (AWS) power outages has left some of my EBS volumes in an inconsistent state.  If these were simple volumes, each containing a filesystem, then the fix is easy:  just dismount the filesystem, run fsck to check it, and then remount the filesystem after it has been fixed.  We have done this on several of our EC2 instances that had inconsistent volumes.

Unfortunately, for these particular volumes we have bonded them together to form a virtual RAID.  And this RAID is used as a single multi-TB filesystem which is much bigger than fsck can handle.  So we are kind of stuck.

One option would be to newfs the big filesystem, and to move the several TBs of content back into AWS, but that would be very slow.  And if there is another power outage......

So instead we called up our pals at Duracloud and asked them if they could help us enable replication of our content to a second provider.  (The first provider is - ironically - AWS.  But their S3 service, not their EC2/EBS service.)  They said they'd be happy to help, and, in fact, they will starting to replicate our content later this same week.  (Now that's service!)

The new copy of our content will now be replicated in...... SDSC's storage cloud.  This really brings us full circle at ICPSR since our very first off-site archival copy was stored at SDSC. Back then (like in 2008) it was stored in their Storage Resource Broker (SRB) system, and we used a set of command-line utilities to sync content between ICPSR and SDSC.  

The SRB stuff was kind of clunky for us, especially given our large number of files, our sometimes large files (>2GB), and our sometimes poorly named files (e.g., control characters in file names).  Our content then moved into Chronopolis from SRB, and then at the end of the demonstration project, we asked SDSC to dispose of the copy they had.  But now it is coming back......

Wednesday, July 18, 2012

ICPSR web maintenance

We're updating a few pieces of core technology on our web server this afternoon:  httpd, mod_perl, Perl, and a few others.  Normally we like to perform maintenance like this during off-hours, but we're doing it at 12:30pm EDT today so that we have "all hands on-deck" to troubleshoot and solve problems.

We've already performed this maintenance on our staging server, and that went smoothly.  Our expectation is that this maintenance will last 15-30 minutes.

Monday, July 16, 2012

ICPSR 2012 technology recap - where did the money come from?

We're putting together some summary numbers for technology spending and investments at ICPSR for FY 2012.  (The ICPSR fiscal year is the same as the University of Michigan's, and runs from July 1 to June 30.  We've just recently closed FY 2012.)

The first set of numbers shows the allocation of effort in FY 2012 by funding source. The unit of measurement in this pie chart is HOURS (not DOLLARS) that were expended in FY 2012 by each funding source.  (We originally wanted to calculate dollars, but that turns out to be an even bigger effort.)  Here's an interactive chart:



This is an interactive Google Docs chart.  If you click slices of the pie, it will identify the funding source.

The main source of technology effort funding comes from the Computer Recharge, an hourly "tax" that ICPSR levies against all projects.  Although it is one single funding source (nearly 45% of hours worked in FY 2012 were billed against this source), I have split it into two sub-categories, one for what I am calling "IT" and one for "SW" (software).

The "SW" portion includes the effort of all staff who are professional software developers.  The type of work performed by this team using this account includes enhancements and maintenance for ICPSR's core data curation and data management systems, and investments in new products and services such as software developed to support our IDARS system for applying for access to datasets.

The "IT" portion includes the effort of the remainder of the staff which tends to include systems administrators, architects, network managers, and desktop support specialists.  I also allocate my own time to this bucket since the majority of my non-contract, non-grant effort over the past year has been in building and architecting technology systems.

Other big slices of the "IT pie" include the work of staff members who are explicitly funded by projects such as our CCEERC and RCMD web portals; our two Bill and Melinda Gates Foundation grants; the ICPSR Summer Program, and many more.  In fact, there are over 20 separate funding sources used to support technology at ICPSR; the pie chart shows 18 because I grouped several small ones into a category called "Misc."

If this gives the impression that there are many, many projects and activities at ICPSR that involve technology, that's good!  That is certainly the case.

However, if "focus wins" then we're in a little bit of trouble.  My sense is that each of this 20-some funding sources has at least one unique project with its own business analysis and project management needs, and it is sometimes the case that different projects have antithetical technology needs.  I see this play out in all phases of the OAIS lifecycle.  ("I want you to build a system that makes it as easy as possible to fetch datasets from ICPSR" v.  "I want you to build a system that requires significant effort and oversight to fetch datasets from ICPSR.")


Friday, July 13, 2012

Surviving the move to Google

The University of Michigan is moving its business productivity systems (mail, calendar, among others) to Google this year.  Some university institutes and colleges have already made the move, and others will make the move later this year.  ICPSR and its parent, the Institute for Social Research, will move in August, although about 70 staff at ISR will make the move early.  These "Google Guides" will help others with the transition.

One concept that might be helpful for the move is to distinguish between an email address and email mailbox.

An email address is basically a pointer.  It can point to another email address, or it can point to an email mailbox.  People publish and share their email address with others, and this is the piece of information we use to target a piece of email.

An email mailbox is a place where email lands.  You log in to an email system (Gmail, Exchange, AOL, and many more) with an ID and password, and once there, you can search for messages, read messages, sort, filter, send, and the rest.

The big change at the U-M this year with email is with everyone's email mailbox.  It's changing from some legacy system at the U-M to Gmail.

Below is a typical example of how things work today at ICPSR.  The @icpsr.umich.edu email address is just a pointer to the U-M enterprise directory entry.  That entry - which ends with @umich.edu - is also just a pointer to another email address ending in @isr.umich.edu.  And that email address is the thing that actually points to the email mailbox, which in our case is the ISR Exchange server.  The first diagram below shows the relationship between email addresses and email mailboxes in the current system for my own email:

Contrast that with the image just above that shows the relationships after the move to Google.  There are two main changes.

The first is that the email mailbox now lives in Google Gmail rather than Exchange, and my email software is a web browser rather than Outlook.  This is a very big change.  Some will find it a pleasant change, and others will hate the new system.

The second is that the roles of the @isr.umich.edu and @umich.edu email address have reversed.  The @isr.umich.edu email address is now just a pointer to another email address, and the @umich.edu email address is the one that "points" directly to Gmail.  And, of course, just like before, one can publish or use any of the three addresses, and the mail goes to the same email mailbox.

Wednesday, July 11, 2012

Tagging EC2 instances and EBS volumes

Adding this to my Amazon (Web Services) wishlist....

Optional Billing tag which can be set when an EBS volume is created or when an EC2 instance is launched

There is a lot of convenience is having a single AWS account.  It makes it easier to find running instances in the AWS Console.  It eliminates the need to share AMIs across accounts.  It obviates the need to remember (and record) multiple logins and passwords.

However, there is one big win in having multiple AWS accounts:  It makes it easier to tie the charges for one set of cloud technology (one account) to a revenue source.  And so we often have four or five different AWS accounts for the four or five different projects we have underway.

It would give me the best of both worlds if I could have my single AWS account, but then specify a special-purpose tag (say, Billing) when I provision a piece of cloud infrastructure.  This would be an optional tag that I could set I launch an instance or create a volume.  This tag would control the format and grouping of charges on my monthly AWS invoice for that account.

For example, say I launch a small instance and set the value of the Billing tag to U12345 (a made-up University of Michigan account number).  And then I launch a second one with a Billing tag of F56789.  And then in addition to the usual AWS invoice with a line item like this:

AWS Service Charges


Amazon Elastic Compute Cloud
US East (Northern Virginia) Region
    Amazon EC2 running Linux/UNIX
        $0.080 per Small Instance (m1.small) instance-hour  1440 hours   $115.20

I would see an additional section:

AWS Service Charges by tag
U12345

Amazon Elastic Compute Cloud
US East (Northern Virginia) Region
    Amazon EC2 running Linux/UNIX
        $0.080 per Small Instance (m1.small) instance-hour  720 hours   $57.60
F56789

Amazon Elastic Compute Cloud
US East (Northern Virginia) Region
    Amazon EC2 running Linux/UNIX
        $0.080 per Small Instance (m1.small) instance-hour  720 hours   $57.60

This would make it easy for me to take my single invoice from Amazon and "allocate" it (a term from Concur, the system we use for managing this sort of thing) to the right internal account.

Monday, July 9, 2012

June 2012 deposits at ICPSR

Deposit numbers from June:

# of files# of depositsFile format
11video/h264
36817application/msword
31application/octet-stream
27626application/pdf
94application/vnd.ms-excel
11application/vnd.ms-powerpoint
11application/x-rar
83application/x-sas
11816application/x-spss
7138application/x-stata
11image/jpeg
61image/x-3ds
221multipart/appledouble
44text/html
361text/plain; charset=iso-8859-1
313text/plain; charset=unknown
38910text/plain; charset=us-ascii
11text/plain; charset=utf-8
43text/x-mail; charset=us-ascii
11video/unknown


Quite a bit of Stata this month, much more than normal.

Wednesday, July 4, 2012

Ixia Communications acquires BreakingPoint Systems

A friend of mine mailed me a link to a TechCrunch article that got me thinking about ICPSR:

Network Testing Consolidation: Ixia Pays $160M Cash For Security-Focused BreakingPoint Systems

So what does this have to do with ICPSR?

Almost eleven years ago, Ixia made its very first acquisition:

Ixia Announces the Acquisition of Caimis, Inc.

(The link above is from the Internet Archive's Wayback Machine.)

Caimis was a small software company that a handful of us founded in 2000.  Some had come from a pioneering Internet company called ANS Communication, and were looking for something very different after having been acquired by Worldcom in 1998.  And others were from CAIDA, which is very much still alive and well (unlike ANS Communications or Worldcom).

Founding and growing Caimis was an exciting time, and selling the company to Ixia was a hard, but good, decision for us.  The deal closed in late 2001 just after the 9/11 attacks, and that made the long flight to Los Angeles to finalize the papers even more "exciting" than usual.

Ixia was a maker of hardware and had a pretty thorough process for manufacturing systems, assigning part numbers to every last item, and managing projects with an amped-up version of MS Project.  We were a very small, very loose software company with very little process.  This led to a gigantic clash in cultures, and things took a turn for the worse after six months:  Ixia decided to close down the Ann Arbor office and shut down several projects.

Like a few others, I decided to stay in Ann Arbor, and was looking around for the next thing to do in mid 2002.  Eventually I came across an ad in the NYT or perhaps Chronicle that a place called ICPSR was looking to hire a new technology director.

Working at an organization which was unlikely to be sold, or moved, or merged, or.... was very attractive at that time. Also, working in a more stable situation was highly desirable after seven years of constant change and turmoil (some good but some not very nice at all).

The job itself looked interesting.  Basically a CIO/CTO type job at a medium-sized not-for-profit.  Technology leader.  Part of the senior management team.  Work closely with the CEO.  And the entire team was in Ann Arbor - no more late night and early morning phone calls on a routine basis with colleagues and employees all over the world!  So I interviewed and got the job, and having been having a lot of fun at ICPSR ever since.

Monday, July 2, 2012

Leap second - No, sir, don't like it

Leap second:

I remember the big Y2K todo.  Lots of hype, lots of worry, lots of prep, and then nothing happened.

But this Leap Second from last Saturday.  Sheesh.

I saw our production web server's java-based webapps go into tight little loops that consumed lots of CPU, and did little web application serving.  I rebooted that Saturday night.

Then on Sunday I saw a report that our (non-production) video streaming server (based on Wowza which is a java webapp too) had become unresponsive.  We rebooted that early, early Monday morning.

And then our staging server freaked out too, and we rebooted in Monday morning.

I don't like leap second.

Amazon Web Services makes Tech@ICPSR weep

June 2012 was looking to be a great, great month for uptime.  We were on track to have our best month since November 2011 this fiscal year - just 60 minutes of downtime across all services and all applications.  It was going to be beautiful.

And then Amazon Web Services had another power failure.

And then we wept.

The power failure took the TeachingWithData portal out of action.  (To be fair, it was already having significant problems due to its creaky technology platform, but this took it all the way out of action.)  The failure also took our delivery replica out of action, and gave Tech@ICPSR the joy of rebuilding it over the weekend.

But the real trouble was with a company called Janrain.

Janrain sells a service called Engage.  Engage is what allows content providers (like ICPSR) to use identity providers (like Google, Facebook, Yahoo, and many more) so that their clients (like you) do not need to create yet another account and password.  Engage is a hosted solution that we use for our single sign-on service using existing IDs, and it works 99.9% of the time.

However, this hosted solution lives in the cloud.  We just point the name signin.icpsr.umich.edu at an IP address we get from Janrain, plug in calls to their API, and then magic happens.

Except when the cloud breaks.

Amazon took Engage off-line for nearly four hours.  And then once it came back up, it was thoroughly confused for another three hours.  Ick.

So, counting all of that time as "downtime" our fabulous June 2012 numbers suddenly became our awful June 2012 numbers.  Here they are:





If you click on the image above, Blogger will make it bigger.

Of course, during a lot of that downtime, all of the features on the web site except for third-party login worked fine.  And most of the problem happened late on a Friday night and Saturday morning during the summer, so that's a good time for something bad to happen, if it has to happen at all.

Wednesday, June 27, 2012

Video at ICPSR - OAIS and Access

We're taking a pretty close look at Kaltura as the access platform for a video collection we are ingesting. Here's why....

If we look at the Open Archival Information System (OAIS) lifecycle, most of the Ingest work is taking place outside of ICPSR. (In fact, other than providing much of the basic IT resources, like disk storage, our role is very small in this part of the lifecycle.) Managing the content and keeping copies in Archival Storage is a good fit for ICPSR's strengths; the content is in MP4 format and has metadata marked up in Media RSS XML, so that's relatively solid.

The big questions for us are all on the Access side of OAIS. Questions like:

  • How many of the 20k videos will be viewed on a routine basis?  Or ever?
  • How many people will want to view videos simultaneously?
  • Will viewers be connected to high-speed networks that can stream even high-def video effortlessly, or will most of the clientele be located on broadband connections?  Is adaptive streaming important?
  • Will support for IOS devices - which do not tend to do well with Flash-based video players - be important?
  • Can people comment on videos?  Share them?  Clip them?  Share the clips?
I have a requirement from one of our partners to build enough capacity to stream a pair of videos - these are classroom observations and each includes a blackboard video and a classroom video - for up to 1000 simultaneous viewers.  That's 2000 videos at a bit-rate (roughly) of 800Kb/s.  So maybe about 1.6Gb/s of total bandwidth required at peak.

And I have the same requirement from one of our other partners who is serving a separate audience.  So that is a total of 3.2Gb/s.  That is a big pipe by ICPSR standards.  (Our entire building that we share with others has only a single Gb/s connection to the U-M campus network.)

If we try to build this ourselves we need a pretty big machine with lots of fast disk (20TB+) and lots of memory and lots of network bandwidth. And if we build it too small, the service will be awful, and if we build it too big, we will waste a lot of money and time.

So a cloud solution that can scale up and down easily is looking pretty good as an Access platform.

Next post:  Why Kaltura?

Monday, June 25, 2012

Video and ICPSR

I've posted a few times about a large collection of video that ICPSR will be preserving and disseminating as part of a grant from the Bill and Melinda Gates Foundation.  I'll devote some time this week to a couple of detailed posts about what we're doing, but one vendor that I'd like to mention briefly today is Kaltura.

Kaltura is a video content management and delivery service that offers both a hosted and on-premise solution.  The University of Michigan is entering into a relationship with Kaltura, and I'm serving on a committee which is helping shape that relationship.  (More on this later.)

I have early access to Kaltura's hosted solution for video content, and I've used that access to upload a few pieces of public domain content plus some minimal metadata.  I then have used Kaltura's tools to assemble combinations of video collections (playlists) and video players, mixing and matching liberally to get a sense for what is possible.

Here's what I have so far:
More on "video @ ICPSR" later this week.

Friday, June 22, 2012

AWS power outage aftermath

As it turns out, it doesn't take all that long to run fsck on a large filesystem comprised of multiple AWS Elastic Block Storage (EBS) volumes:


[root@cloudora ~]# df -h /dev/md0
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0              4.9T  2.9T  1.7T  64% /arcstore
 
[root@cloudora ~]# fsck -y /dev/md0
fsck 1.39 (29-May-2006)
e2fsck 1.39 (29-May-2006)
/dev/md0 is mounted.
WARNING!!!  Running e2fsck on a mounted filesystem may cause
SEVERE filesystem damage.
Do you really want to continue (y/n)? yes
/dev/md0 has gone 454 days without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes
Error allocating icount link information: Memory allocation failed
e2fsck: aborted

saw a few posts about coaxing e2fsck to use the filesystem for scratch space rather than memory, but unfortunately the older version of the program available on this EC2 instance does not support it.

So I think that we may end up blowing away this copy of archival storage and replacing it with a fresh one.

Wednesday, June 20, 2012

Amazon power outage and Amazon

Amazon suffered a power outage in their northern Virginia data center last week.  Here is my abridged timeline of events from the Amazon Service Health Dashboard:

Jun 14, 8:50 PM PDT We are investigating degraded performance for some volumes in a single AZ in the us-east-1 region.
Jun 14, 10:29 PM PDT We can confirm a portion of a single Availability Zone in the US-EAST-1 Region lost power. We are actively restoring power to the effected EC2 instances and EBS volumes. We are continuing to see increased API errors. Customers might see increased errors trying to launch new instances in the Region.
Jun 15, 12:11 AM PDT As a result of the power outage tonight in the US-EAST-1 region, some EBS volumes may have inconsistent data. As we bring volumes back online, any affected volumes will have their status in the "Status Checks" column in the Volume list in the console listed as "Impaired." You can use the console to re-enable IO by clicking on "Enable Volume IO" in the volume detail section, after which we recommend you verify the consistency of your data by using a tool such as fsck or chkdsk. If your instance is stuck, depending on your operating system, resuming IO may return the instance to service. If not, we recommend rebooting your instance after resuming IO.
Jun 15, 3:26 AM PDT The service is now fully recovered and is operating normally. Customers with impaired volumes may still need to follow the instructions above to recover their individual EC2 and EBS resources. We will be following up here with the root cause of this event.
And, indeed, Amazon did follow-up on the root cause of the problem.  Based on the post-mortem that has been reported in several venues, the root cause was a fault in commercial power.  And a generator.  And an electrical panel.  One view is that Amazon got very unlucky with power problems; another view is that they did not test their fail-over thoroughly enough.  I lean more to the former view.

ICPSR didn't suffer any outages.  For example, our cloud-based replica was available to us the entire time.  We did receive notifications from Amazon that specific EBS volumes (basically a virtual block device that may be attached to a cloud-based machine) may have been corrupted, and should be inspected.  Amazon included the specific volume.  Here's an example notification:
Dear ICPSR Technology ,
Your volume may have experienced data inconsistency issues due to failures during the 6/14/2012 power failure in the US-EAST-1 region. To restore access to your data we have re-enabled IO but we recommend you validate consistency of your data with a took such as fsck or chkdsk. For more information about impaired volumes see:
http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/monitoring-volume-status.html
Sincerely,
EBS Support
So this did create a bit of unscheduled work for the technology team because we had four affected volumes.

One was not attached to anything, and was not in use.  

One was attached to a machine we had recently retired.

But two were attached to a machine that stores an encrypted copy of our archival holdings.  The volumes are each 1TB and part of a multi-TB virtual RAID.  This makes for a very, very long-running fsck to inspect for problems.

I'll have the conclusion on Friday.