Technology at ICPSR: September 2012

Friday, September 28, 2012

Be careful when answering, "Both"

When I am working with someone to work through the requirements of a new system or project, I will often ask a series of questions that help shape my understanding of what the person wants. Often these fall into a pattern where I ask a series of either/or questions, like this:

Do you want it optimized for security or ease of use?

Does this fit into a wholesale or retail delivery paradigm?

Will this be used by external customers or internal staff?

Is this intended to drive new revenues or decrease current costs?

In many ways this is like going to the ophthalmologist who has you look through lens A and then lens B, and then asks the question, "Which was better, A or B?" Both of us are trying to bring the problem into focus.

The single most dreaded (by me) answer to these questions is: Both.

In some cases this answer really means, "I am not sure what I want." Or, "I'm too busy to think about this." Or, "I don't care."

This, obviously, does not help when gathering requirements. And so it is a real barrier to scoping the project. Sometimes, of course, an answer of Both is a fine start to a longer answer.

We really do need both in this case. We want to build a system for managing metadata that can be used by both the staff and external people equally easily. We are changing our entire workflow so that either population can manage our metadata, and this is our new business practice.

That is a fine use of Both. In fact, if the person gave a different answer, we might needlessly limit the usefulness of the system we build.

And another fine answer, just like we sometimes tell the ophthalmologist is I don't know. There's nothing wrong with that answer. However, just like with the ophthalmologist, when I hear this answer, I reach into my bag of lens, and try another pair to bring the issue into focus.

Wednesday, September 26, 2012

Kaltura pilot at the U-M and ICPSR

The University of Michigan in-house newspaper/newsletter ran a nice piece on the Video Contement Management pilot (using Kaltura), which includes our Bill and Melinda Gates Foundation project: Measures of Effective Teaching - Extenstion.

Here is the part about us:

• ISR and the School of Education are engaged in a collaborative research project in which a large collection of video assets is a primary data-type. This data will become a shared repository available to research partners at universities, public agencies, and private foundations.

The timing here was perfect for us since we were in the market for a system to manage and stream about 20TB of video to thousands of simultaneous users.

Monday, September 24, 2012

DuraSpace announces SDSC as storage partner

DuraSpace announced recently their relationship with SDSC as a storage provider for DuraCloud. As I posted a while ago, we have been using both DuraCloud and their SDSC storage partner for a while. It's great to see DuraCloud continue to grow.

Friday, September 21, 2012

Setting up Kaltura - part V

In this post we will look at the XML we use to ingest content into Kaltura through its Drop Folder mechanism. To re-cap an earlier post about the Drop Folder, ours is a subdirectory under the home directory of the 'kaltura' user. We provisioned this account on a special-purpose machine that Kaltura accesses via sftp to fetch content without human intervention.

Kaltura has a pretty nice guide to building the XML, which looks an awful lot like Media RSS. And they also make some short examples available, but we always find it useful to have a real-world example. Here's ours.

Preface: Our stuff is a little unusual. That is:

We always have pairs of videos, one classroom and one blackboard
We have lots of metadata and it applies to both videos, and so the metadata gets repeated in the XML. I have excised much of the metadata in our XML for this post
All of the metadata is fake; it is not real metadata about an actual classroom video
You can find a copy of the XML that we diagram below at this URL http://goo.gl/OxbyU
Our use of Kaltura is in support of the Measures of Effective Teaching (Extension) project, and so there are many references to 'metext' in the metadata
We will be generating the XML for ingest programatically

Here goes.

 <?xml version="1.0"?>  
 <mrss xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="ingestion.xsd">  
     <channel>  
         <item>  
             <!-- These do not change -->  
             <action>add</action>  
             <type>1</type>  
             <!-- This changes for each video -->  
             <referenceId>12345-Board-video-rendition-MP4-H264v-Standard.mp4</referenceId>  
             <!-- This does not change -->  
             <userId>metext</userId>  
             <!-- This changes for each video -->  
             <name>RCN 12345 board video</name>  
             <!-- This changes for each video -->  
             <description>RCN 12345 board video</description>

We have the usual XML stuff at the beginning, and then the start of the Media RSS.

Action = add for new content.

Type = 1 for video content.

ReferenceID is the name of the original file.

UserID is the pseudo-user who will be linked to the content in Kaltura.

Name and Description are exposed as base metadata in Kaltura.

             <!-- Always assign two tags, one called metext and the other board or classroom -->  
             <tags>  
                 <tag>metext</tag>  
                 <tag>board</tag>  
             </tags>  
             <!-- This does not change -->  
             <categories>  
                 <category>metext</category>  
             </categories>  
             <!-- This does not change -->  
             <media>  
                 <mediaType>1</mediaType>  
             </media>

We tag everything with the project name and indicate whether it is a video of the blackboard or the classroom.

Kaltura uses Categories as a main way to browse and find content. We treat this as if it were a type of "is in collection" sort of attribute.

MediaType = 1 for video.

             <!-- This changes for each video -->  
             <contentAssets>  
                 <content>  
                     <dropFolderFileContentResource filePath="12345-Board-video-rendition-MP4-H264v-Standard.mov"/>  
                 </content>  
             </contentAssets>

This tells Kaltura that the video file is in the Drop Folder along with the XML.

Now for our project-specific metadata, which fits into a Kaltura structure called Custom Data:

             <!-- This changes for each video -->  
             <customDataItems>  
                 <customData metadataProfileId="22971">  
                     <xmlData>  
                         <metadata>  
                             <METXDistrictDistrictName>Ann Arbor</METXDistrictDistrictName>  
                             <METXDistrictDistrictNum>20</METXDistrictDistrictNum>  
                             <METXSchoolSchoolName>Huron High School</METXSchoolSchoolName>  
                             <METXSchoolSchoolMETXID>33</METXSchoolSchoolMETXID>  
                             <!-- Kaltura wants the date to be in xs:long format, that is, seconds from the epoch -->  
                             <METXVideoSubmissionCaptureDate>1344440267</METXVideoSubmissionCaptureDate>  
                         </metadata>  
                     </xmlData>  
                 </customData>  
             </customDataItems>

The ID attribute is from our Kaltura KMC. I had to create the Custom Data schema first, and then reference it in the ingest XML here.

Most of the metadata fields are simple strings or strings from a controlled vocabulary. We do have one date item, and sadly Kaltura expects it to be in a difficult-to-use format, seconds since the epoch.

After this section of the XML is a closing tag for item, and then the whole thing repeats with only minor variation for the classroom video. I'll include it below for completeness.

         </item>  
         <item>  
             <!-- These do not change -->  
             <action>add</action>  
             <type>1</type>  
             <!-- This changes for each video -->  
             <referenceId>12345-Classroom-video-rendition-MP4-H264v-Standard.mp4</referenceId>  
             <!-- This does not change -->  
             <userId>metext</userId>  
             <!-- This changes for each video -->  
             <name>RCN 12345 classroom video</name>  
             <!-- This changes for each video -->  
             <description>RCN 12345 classroom video</description>  
             <!-- This changes for each video -->  
             <tags>  
                 <tag>metext</tag>  
                 <tag>classroom</tag>  
             </tags>  
             <!-- This does not change -->  
             <categories>  
                 <category>metext</category>  
             </categories>   
             <!-- This does not change -->  
             <media>  
                 <mediaType>1</mediaType>  
             </media>  
             <!-- This changes for each video -->  
             <contentAssets>  
                 <content>  
                     <dropFolderFileContentResource filePath="12345-Classroom-video-rendition-MP4-H264v-Standard.mov"/>  
                 </content>  
             </contentAssets>  
             <!-- This changes for each video -->  
             <customDataItems>  
                 <customData metadataProfileId="22971">  
                     <xmlData>  
                         <metadata>  
                             <METXDistrictDistrictName>Ann Arbor</METXDistrictDistrictName>  
                             <METXDistrictDistrictNum>20</METXDistrictDistrictNum>  
                             <METXSchoolSchoolName>Huron High School</METXSchoolSchoolName>  
                             <METXSchoolSchoolMETXID>33</METXSchoolSchoolMETXID>  
                             <!-- Kaltura wants the date to be in xs:long format, that is, seconds from the epoch -->  
                             <METXVideoSubmissionCaptureDate>1344440267</METXVideoSubmissionCaptureDate>  
                         </metadata>  
                     </xmlData>  
                 </customData>  
             </customDataItems>  
         </item>  
     </channel>  
 </mrss>

And that's it.

Wednesday, September 19, 2012

An Inconvenient Outage

Some of you probably noticed that we had a rough weekend with the web site. We first saw trouble around 4pm EDT on Saturday. After some trouble shooting and investigation left us unsure of the root cause, we failed things over to our replica around 5pm EDT. We then ran off the replica over night and through the following morning.

The big breakthrough came at 1:30pm or so Sunday when we isolated the cause, and then it took only a few minutes to correct the problem, test the solution, and finally roll service back to the production site. As with any longer outage this one pointed out a bunch of small, but important, changes to make in procedures and documentation.

My apologies if you happened to be using our web site late afternoon on Saturday; the was certainly the roughest time.

Monday, September 17, 2012

Introducing ICPSR's Virtual Data Enclave (VDE)

The ICPSR Virtual Data Enclave (VDE) is a secure, virtual environment in which a researcher can analyze sensitive data, create research products, and then take possession of those products and analysis. And while he VDE is not a substitute for a physical enclave and the types of security protocols it facilitates, the VDE is very much a potential substitute for the traditional practice of distributing confidential data via removable media, such as CD-ROMs.

The VDE uses much of the same technology that ICPSR uses internally for its Secure Data management Environment (SDE) which we have described a few times. In brief, we use a virtual desktop environment that is operated by the University of Michigan's central IT shop and connect it to what we call our Private Network Attached Storage (NAS) appliance. Both the virtual desktop and NAS are behind a firewall, and we use the firewall and Windows group policies to restrict what actions one pay perform. Download? Nope. Cut-and-paste between the virtual desktop and the real desktop? Uh uh. Capture screenshots by taking a picture of your monitor? Well, ......

The virtual environment keeps sensitive datasets under lock and key at ICPSR, but makes it available to researchers. The environment contains the usual array of applications used in the social sciences (but no email!), exactly the same sort of stuff we might set up for a visiting scholar or OR.

The researcher accesses the environment through a small, easy-to-download and -install client based on VMware View Client. Authentication takes place using standard University of Michigan credentials which we (ICPSR) and others at UMich can issue to "friends." Access between the real desktop and the virtual desktop is encrypted, and we are in the process of adding IPSEC encryption between the virtual desktop and the NAS. (This latter traffic passes over UMich's data backbone, and access to those routers is limited to UMich central IT network engineers.)

The virtual machine is completely ephemeral and can be wiped after each use. Any intermediate research or results are stored on the ICPSR NAS. Our NAS is backed up weekly, and tapes are cycled off-site quarterly. Once the research has been completed ICPSR retains a "just in case you need it" snapshot for up to three years.

Friday, September 14, 2012

Setting up Kaltura - part IV

We have been working on getting a Kaltura Drop Folder set up. A Drop Folder is a mechanism where an organization spools content to be ingested in a fixed location, and Kaltura polls the location, watching for content to ingest.

In our case it has taken about a month to get the Drop Folder configured, and much of this delay is preventable if you avoid the same pitfalls we did. So in the spirit of giving back to the community, here are seven things to know when setting up a Drop Folder.

Host the Drop Folder yourself, do not host it at Kaltura.
Set up an account and a password on the machine, and share them with Kaltura. To keep things very simple I created an account called 'kaltura'.
Create a subdirectory under the kaltura user's home directory that will actually contain the content to be ingested. To keep things very simple I used the name 'dropfolder'
Make sure that the kaltura user owns the Drop Folder directory, and that its access controls grant appropriate rights to other users that may need to ingest content
Tell Kaltura the name of the machine. To keep things very simple I created a DNS CNAME record, kaltura.icpsr.umich.edu, that points to the right machine.
Be sure you have ssh installed and running on port 22 on the machine. If you normally do not run ssh on port 22 (we don't), don't forget to open a hole in your firewall so that Kaltura machines can reach the Drop Folder.
Tell Kaltura to use sftp and port 22 to connect to your Drop Folder. Do not try to use a port other than 22.

To re-cap the values we used:

Host: kaltura.icpsr.umich.edu
Protocol: sftp
Port: TCP 22
Login: kaltura
Password: XXXXXXXX
Drop folder: dropfolder
Drop folder UID:GID: kaltura:met
Drop folder mode: 2775

(We have a big video project called MET, and automated jobs running with the 'met' GID will need write access to the drop folder.)

You may be tempted to suggest using ssh keys or non-standard ports for ssh. Fight those temptations.

Kaltura will offer to auto-delete content once it has been ingested. Accept that offer.

Know that when you delete items from the Drop Folder status window in your Kaltura KMC it will also delete them from the Drop Folder. This is not obvious, but turns out to be useful.

Now all you need are automated jobs that place content and Kaltura-style Media RSS XML into the Drop Folder. Kaltura has some nice examples on-line, but they are somewhat trivial. We'll post some more complex, real-world examples next week.

Friday, September 7, 2012

August 2012 deposits

Light month for deposits:

# of files	# of deposits	File format
56	30	application/msword
69	26	application/pdf
9	8	application/vnd.ms-excel
2	1	application/vnd.wordperfect
4	1	application/x-dosexec
24	7	application/x-sas
47	21	application/x-spss
7	2	application/x-stata
26	2	application/x-zip
1	1	image/gif
18	4	image/jpeg
1	1	message/rfc8220117bit
3	3	text/plain; charset=iso-8859-1
17	10	text/plain; charset=unknown
32	9	text/plain; charset=us-ascii
3	2	text/plain; charset=utf-8
39	7	text/rtf
1	1	text/x-mail; charset=iso-8859-1
4	2	text/x-mail; charset=unknown
2	2	text/x-mail; charset=us-ascii
1	1	text/xml

Just the usual stuff, but in pretty low quantities.

Wednesday, September 5, 2012

ICPSR web availability - August 2012

August was not our best month.

We did a bit better than 99.3% uptime. Almost all of the downtime is due to a recurring, as-yet-unsolved problem we are having with our Oracle database platform. The primary symptom is that the database platform stops fielding queries for about 5-15 minutes, which disables our production web site. The platform does this about 30-45 minutes AFTER it has finished a full export using the Oracle datapump system.

Because our existing Oracle hardware is old and has a relatively slow disk I/O system, we're going to try to solve this problem by throwing hardware at it. For well under $10k we can replace our five-year-old hardware with something much newer. Goodbye RAID-5 SCSI, hello SSD.