Technology at ICPSR: August 2012

Wednesday, August 29, 2012

Setting up Kaltura - part III

Some good news and some bad news today.

The good news is that I've tested XML Ingest via the Drop Folder feature, and it seems to work very well. I was able to upload two videos, add their extensive metadata, and get it working well after fixing a simple (but dumb) typo I made in the XML. In terms of creating the right XML content, we are in great shape.

The bad news is that I've run into a couple of snags with the Kaltura Software as a Service (SaaS) offering.

The first is actually with the Drop Folder service. The current solution we are using is where Kaltura hosts the Drop Folder and we use sftp with a password to transfer a bundle of content - Kaltura-customized Media RSS XML plus a pair of video files. However, what we really need is a locally hosted Drop Folder (which costs less to operate) and a way for Kaltura to fetch the content. So far Kaltura hasn't been able to make this work. We had hoped that they could use an ssh key-pair (private at Kaltura and public installed in $HOME/.ssh/authorized_keys) for access, but at this time, Kaltura does not support ssh keys. So we are kind of stuck waiting for ssh key-pair support to appear, or we are stuck uploading content via sftp and typing passwords (i.e., a manual solution). Ick.

The second issue is around counting bits. Kaltura charges by the bit - storing them and streaming them. That makes a lot of sense, and is actually fine by us. However...... so far we are not seeing reports or analytics that report usage by the bit, only by the minute.

In a case where one has a single collection with a single delivery platform operated by a single administrative unit, this may work quite well. The monthly bill from Kaltura goes to one place, and the analytics are very useful for reviewing what's getting played, how often, how much, etc.

But in a case where one has a single collection (like Measures of Effective Teaching) with multiple delivery platforms operated by multiple administrative units, this will prove problematic. In this scenario we really need a bill that breaks out usage like this:

Bits stored for the month - XX GB
Bits streamed by delivery platform A - XX GB
Bits streamed by delivery platform B - XX GB
Bits streamed by delivery platform X - XX GB

Then the partners could carve up responsibility for the bill. For example. maybe ICPSR pays for ALL of the storage and for the bits streamed by its delivery platform, the School of Education pay for the bits streamed by its delivery platform, and a future partner-to-be-named pays for the bits it streams.

We had hoped to start using Kaltura in September, but my sense is that the Drop Folder issue will push this back. But the issue around counting is the big one, and I don't have a sense yet for whether this is easy to address or very difficult to address.

Friday, August 17, 2012

Setting up Kaltura - part II

I mentioned in the last Kaltura post that we've set up a Custom Data schema to hold the descriptive metadata for our video content. Setting up that schema took some considerable effort, and I thought I might share some of the details with this post. For context we are using the newly release Falcon edition of the Kaltura Management Console (KMC) web application.

One creates a Custom Data schema by using the Settings menu in the KMC, and then by navigating to the Custom Data tab. A button allows one to Add New Schema, and I happened to give ours the name of "MET Extension" since that is the name of the project generating the video. I did not set a System Name for the schema, and while Kaltura said that was required, the KMC does not enforce it, and lack of a System Name has not yet proven to be a problem.

One adds fields/elements to the schema one at a time using a pop-up window, Edit Entries Metadata Schema. This can be seriously laborious if you have a large schema like mine with 40 elements. Lots of cutting and pasting. Kaltura allows one to export the schema as an XSD XML file, but one cannot import such a file to create or update the schema.

Elements can be Text, Date, Text Select List, or Entry-id List. Each can be single or multi-value, and each can be indexed for search (or not). The KMC allows one to supply both a short and longer description for each element.

Text fields are exactly what you would expect, and Text Select Lists are basically pick-lists. The Entry-id field is useful if you want to store the ID of an extant Kaltura object. Date is a little tricky since the format one uses to supply a value for this field works one way in the KMC interface - conventional calendar format or pick from a calendar widget - but a very different way when ingesting via XML where it expects an integer number of seconds since the epoch.

We will be ingesting tens of thousands of videos into Kaltura, and so we will NOT be using the KMC to upload videos and to compose metadata. Instead we will be using their Drop Folder mechanism where one puts XML metadata (including pointers to video files) in a special-purpose local location to which Kaltura has access. Preparing the XML content that includes both the descriptive and technical metadata is our current project, and I'll report on that process - and the Drop Folder process - next.

Friday, August 10, 2012

July 2012 web availability at ICPSR

July 2012 was a pretty good month:

Clicking the image will open a larger, easier-to-read chart.

We only has 36 minutes of downtime in July, and 28 were due to maintenance as we tried (but failed) to update apache, perl, and mod_perl on our production web server. We discovered some interesting idiosyncrasies in some perl libraries during the maintenance. (Summary: Multi-word time zones like "New York" are trouble.)

Wednesday, August 8, 2012

July 2012 deposits at ICPSR

The totals for July 2012:

# of files	# of deposits	File format
1	1	F 0x07 video/h264
5201	1	application/dicom
1	1	application/msaccess
112	23	application/msword
121	4	application/octet-stream
241	41	application/pdf
6	5	application/vnd.ms-excel
1	1	application/vnd.ms-powerpoint
1	1	application/x-7z-compressed
44	1	application/x-arcview
83	1	application/x-dbase
73	3	application/x-dosexec
5	1	application/x-empty
11	3	application/x-sas
1	1	application/x-shellscript
150	21	application/x-spss
23	6	application/x-stata
28	4	application/x-zip
10024	5	image/jpeg
8	1	image/png
4	1	image/x-ms-bmp
2	2	message/rfc8220117bit
15	1	multipart/appledouble
5	5	text/html
1	1	text/plain; charset=iso-8859-1
4	3	text/plain; charset=unknown
157	38	text/plain; charset=us-ascii
7	4	text/plain; charset=utf-8
7	4	text/rtf
4	3	text/x-mail; charset=us-ascii
43	4	text/xml
25	1	video/unknown

Lots of image content this month to go with the usual stuff (e.g., SPSS) in the usual volumes.

Monday, August 6, 2012

Setting up Kaltura - part I

I've mentioned in previous posts that the University of Michigan is implementing Kaltura as its video content management solution. Kaltura is an open-source video platform that one can install and operate locally, and is also available in a software as a service (SaaS) version. The U-M is making use of the SaaS edition, and ICPSR is one of three pilot testers.

The off-the-shelf web application provided by Kaltura to manage content, collect analytics, publish content, create custom players, set access controls, etc. is called a Kaltura Management Console (KMC, for short). A major question for any enterprise using Kaltura is: How many KMCs do we need? The answer is: Just enough, and not one more.

It is very difficult to share content between KMCs, and so there is a major incentive to have the smallest number of KMCs, perhaps only a single one. However, it is also difficult to "hide" content from others who are sharing the same KMC, and that can cause concerns about privacy, access control, and inadvertent use (or mis-use) of content. In my mind giving someone an account on a KMC is like giving someone root access on a UNIX machine.

The solution we used at the U-M was to deploy two KMCs for now. One is for two types of content: video which is generally available to the public, such as promotional materials, and video which is used in courses via our local Sakai implementation, CTools. We provisioned a second KMC for ICPSR to use for its content, which falls more into the "research data" category. This content will require signed agreements for access.

Once Kaltura provisioned our KMC I performed a few initial house-keeping chores:

Created accounts (Authorized KMC Users) for my colleagues on the project. Each has a Publisher Administrator role. (Administration tab in the Falcon release of the KMC.)
Changed the Default Access Control Profile to require a Kaltura Session (KS) token. All content managed by this KMC should require a KS token by the player. (Settings - Access Control tab)
Created a new Access Control Profile (called Open) which does not require KS. I don't know if I will need this, but want to have a more open profile available.
Changed the Default Transcoding Flavors to (only) "Source." Our content has already been transcoded, and so we don't need to pay for the time and storage for additional flavors such as HD, Editable, iPad, Mobile, etc. (Settings - Transcoding Settings)
Created a Custom Data Schema to hold the extensive descriptive metadata that accompanies the content generated in our project (MET Extension). This step is extraordinarily tedious since it has to be done field-by-field through a web GUI. I can download a copy of the schema I created in XSD format; wish I could upload one to create it. (Settings - Custom Data)
Created a slightly customized player for use with our content. Wanted to size it to fit our content, remove the Download button, etc. This is super easy. (Studio tab)
Created a Category which we will use to "tag" our content. (In this case I created one called MET-Ext.) This is mostly useful for searching and browsing within the KMC interface. (Content - Categories tab)
Uploaded a few videos and set a few of the metadata fields. (Content - Entries)
Put in a request to our account manager to enable a locally hosted Drop Folder. This is a mechanism whereby we create a local "fetch" location where an automated Kaltura job can pull content and ingest it. While one would think that this is a common mechanism for submitting content, the process is slow and poorly documented, and cannot be managed via the KMC. I'll post more details about the process once I have a working Drop Folder in place.
Created the local infrastructure for the Drop Folder which is really just identifying a machine to play host, and then creating an account Kaltura can use.

These steps got us to the point where we could start putting Media RSS files containing the metadata and pointers to the video into our Drop Folder for ingest into Kaltura.

Friday, August 3, 2012

How can I put a meeting "out to bid?"

At the University of Michigan if one wants to spend $5,000 to buy a small rack-mount server with a five-year service life one needs to put the request out for bid. We then need to justify the vendor selected, and if we choose a bid which is NOT the lowest price, we have even more explaining to do.

It makes sense that the U-M would want to make sure that major purchases receive the right level of oversight and review. One can argue if the $5,000 limit is the right number, but it seems like some number (maybe a higher number?) is the right one.

However, if one wants to schedule a one-hour recurring meeting with nine other people for a monthly meeting for five years (so 60 meetings total with 10 total people at each meeting), the only barrier is sending out the invitation and getting people to attend. Now, of course, if one is a very senior-level person inviting direct reports or dotted-line reports, it is pretty easy to get people to attend.

Each meeting might cost around $500 in staff time; perhaps closer to $1000 if the people are senior (expensive) and we include benefits and any other hourly fees. And so 60 such meetings will cost the U-M somewhere between $30,000 and $60,000 over the couse of five years.

So here's the question: If the goal is to make sure that the U-M is spending its resources wisely, who's making sure that the resources spent on meeting are used wisely?

Meetings - not hardware, software, licenses, cloud computing, paper, printers, etc - are the biggest expense we have.

Wednesday, August 1, 2012

ICPSR 2012 technology recap - where did the money go?

A post from a week or so ago showed the sources of money flowing into the technology organization at ICPSR. This week's post will focus on how that money is spent.

I should note that the focus of this post is on what we call the "Computer Recharge" at ICPSR. This is a tax paid by all FTEs at ICPSR that is levied on an hourly basis. Each time an employee completes his or her timesheet and allocates time to a project (exceptions: sick, holiday, and vacation time), a small amount of money also accumulates in the "Recharge." In FY 2012 this accounted for about 45% of the technology revenue.

Unlike a direct charge for technology where a project or grant may be paying for a dedicated server, extra storage space, or a fraction of a software developer working on custom systems, the Computer Recharge dollars are used to fund technology expenses that benefit the entire organization. This includes expenses as prosaic as printers and desktop computers, but also includes systems and software development for our delivery systems, ingest systems, and digital preservation systems.

Here's the breakdown in chart form:

No surprise that most of the money goes to pay people: systems administrators, desktop support specialists, software developers, and a group of very hands-on managers who do a lot of the same work plus project management and business analysis. This accounts for $862k out of $1218k.

Equipment ($214k) represents desktop machines, new storage capacity, printers, virtual machines, cloud storage, software licenses, and almost every non-salary expense.

Transfers ($103k) is an interesting category. This is money that we collect as major systems depreciate. But since the U-M doesn't really "do" depreciation, we use this interesting process instead:

Buy the item (using money from Equipment pot)
Estimate the lifetime of the item (say, five years)
Collect 1/5 of the purchase price each year for five years
At the end of each year, move the 1/5 collected into Transfers
When the item needs to be replaced, use the money in the Transfers pot

So the $103k represents money that was collected in FY 2012 for items that were purchased in earlier years, and which will need to be replaced in FY 2013 or beyond.

All of our other expenses are tiny by comparison: $19k for Internet access, $7k for telephones and service monitoring, $6k for travel, $5k for maintenance contracts, and $2k for miscellaneous fees and expenses.

One take-away from this is that the essential element in technology budgeting isn't the purchase price of the server, or the annual cost of cloud storage, bur rather the recurring cost of the people who will build, maintain, enhance, and customize the technology portion of the business.