Technology at ICPSR

Clients v. customers | services v. products

2013-07-19T06:00:00.000-04:00

Seth Godin has another excellent post. This one notes the distinction between customers (who decide whether or not to buy your product) and clients (who pay you to make things for them).

Seth Godin

In the context of ICPSR I think we have a product we call "ICPSR membership." Customers buy it (or not), and if they do, they receive a reasonably well defined set of services, largely centered around the ability to access high quality datasets and documentation. We have many hundreds of customers for this product. I think our Summer Program is also a product, and that too has may hundreds of customers.

We also have a smaller number, perhaps a dozen or so, of clients. In the best case we have a handful of clients who all pay us to perform a similar set of tasks for them: curate their datasets and documentation, preserve the curated artifacts, and publish the content on a specially "skinned" version of the ICPSR web site for all the world to see. Adding more clients who want us to do this kind of work benefits all of the other clients, and, often, our customers too.

And like any organization which draws much of its revenue from contract work for clients, we also have those that push us in new, different directions, sometimes for the better, and sometimes for the worse. The trick, of course, is not too try to head off in too many different directions at once. And to favor those clients who pull us in better, not worse, directions.

ICPSR Web Availability - 2012-2013

2013-07-17T06:00:00.000-04:00

Here are the final numbers for ICPSR's web site availability over our last fiscal year:

Click to embiggen

The year did not start off so well, and we reached the nadir quickly. August 2012 was our worst period of availability in a very bad year for us overall. January, March, and June 2012 also had very poor numbers.

The main antagonist we faced was a new and unusual problem with our Oracle database server. For many years we would export the content for backup purposes each evening, and it worked well for a decade. However, suddenly in 2012 we began to experience an outage just AFTER each export. Despite intensive analysis by ourselves and local Oracle exports, we never could isolate the root cause of failure.

We eventually "solved" the problem by exporting our database only once per week v. once per day. That left us more exposed to loss, of course, but it seemed to limit the outages to once per week v. once per day.

We then replaced the hardware with a new machine with a bit more processor and memory, but with blindingly fast solid-state drives. With the new machine deployed we returned to our daily export schedule, and the machine -- and our web availability -- have been in pretty good shape ever since. The machine went into service in April 2012, and the chart above makes it clear that life has been a little less hectic for our on-call engineer since then.

A tiny wishlist for Amazon Web Services' Route 53

2013-06-21T06:00:00.000-04:00

We've been using the DNS hosting service, Route 53, from Amazon Web Services (AWS). The default port for a DNS server is UDP (and TCP) 53, and I've always presumed that this was the answer to the question: Why did Amazon name its DNS service Route 53?

In general I like the Route 53 service pretty well. It's smart how the DNS servers listed for a Hosted Zone (the term AWS uses for a domain hosted in Route 53) reside in different top-level domains, like ORG, NET, COM, and even CO.UK. The UI in the AWS Management Console is fine for managing small zones that contain just a handful of records.

There's one feature that I wish Route 53 had, though, and it would be particularly useful, I think, to research organizations in higher education.

In our grants and contracts there is often a commitment to build, deploy, and operate some technology deliverable. Often the technology is a web portal of some sort, and the investigator is keen to register a new domain. This leads to an initial registration of something like:

WhizBangProject.org

The domain may have only the smallest number of records: an SOA and NS records, of course, and then perhaps an MX record routing mail to a central server, and an A record pointing to the IPv4 address of the web portal.

Soon, though, the researcher may decide to register the same name in different top-level domains, and we have:

WhizBangProject.net

WhizBangProject.com

WhizBangProject.info

joining the mix. These domains have EXACTLY the same records as the first one, and so if one is running his/her own DNS service, one can configure the DNS server to use the same zone file when loading all of the domains. This is nice - one file with one set of records to manage for many different domains.

However, it is often the case that the investigator discovers that the original name is not satisfactory, and so we then register an alternate name in several domains:

CoolBeansResearch.org

CoolBeansResearch.net

CoolBeansResearch.com

CoolBeansResearch.info

and maybe a slight variant too:

Cool-Beans-Research.org

Cool-Beans-Research.net

Cool-Beans-Research.com

Cool-Beans-Research.info

In a world where one runs one's own DNS server, the additional domains are not much extra work. Like the original solution where we pointed the new domains at the same zone file, we can just point these new domains at that same zone file.

I wish Route 53 would let me create a collection of what they call a Record Set, and then apply those same records to an arbitrary set of what they call Hosted Zones. If the SOA and NS Record Sets were unique to each Hosted Zone, that would be OK; it is really the other records - the ones we add ourselves in Route 53 - that we would want to share across all of the Hosted Zones.

EMC transfer_support_materials fix for anonymous ftp

2013-06-19T06:30:00.000-04:00

Last month I posted about an issue we have been having with our EMC NS 120 NAS. To re-cap briefly... When the NS 120 discovers a problem, one action it often will take is to collect up a bunch of diagnostic information, Zip it up, and then use anonymous ftp to transfer it to EMC. A shell script under the /nas/tools directory called transfer_support_materials does the dirty work. The problem we have been experiencing is rooted in this script; it would fault when trying to transfer the Zip file.

The sequence of ftp commands inside the script is simple:

Connect to ftp.emc.com
Log in using the user name anonymous and a password unique to the NS 120
change directory to /incoming/APMxxxxxxxxxxx (where the string of x's is replaced with the NS 120 serial number)
transfer the Zip file

The script would always fail at step #3 with the message: File unavailable.

The root of the problem is that the transfer_support_materials script expects the directory to exist, but it doesn't.

At first I thought that the problem was with the EMC anonymous ftp server. I opened several SRs trying to get someone to create the directory. None of the SRs ever reached a satisfactory closure, and I was left with the impression, Of course the directory doesn't exist; we delete them after a couple of days automatically.

So..... The tool to transfer diagnostics expects the directory to exist, and the business process at EMC deletes the directory as a routine matter.

At the suggestion of one of my colleagues, I ran ftp by hand, and discovered that it would happily let me create the directory. That is, I could manually do this:

Connect to ftp.emc.com
Log in using the same credentials as the NS 120
mkdir /incoming/APMxxxxxxxxxx
cd /incoming/APMxxxxxxxxxx
transfer the Zip file

I decided to tweak transfer_support_materials, adding this new element to the existing sequence of ftp commands. The change is really simple. This:

#do transfer
LFTPCOMMANDFILE="open -u ${username},${password} $HostName;cd $remote_name;rm -f ${newfile##*/};put $newfile"

becomes this:

#do transfer
LFTPCOMMANDFILE="open -u ${username},${password} $HostName;mkdir $remote_name;cd $remote_name;rm -f ${newfile##*/};put $newfile"

Ran a quick test of the script after this change, and Voila!, it works again.

EMC anonymous ftp service and transfer_support_materials

2013-05-01T11:22:00.000-04:00

I have not seen notes about this in forums and boards, and so thought I would pass this along to others who may be using EMC gear.

About a month ago we had a small problem with one of our NS 120 Celerra NAS units. (It may have been soft errors on one of its disk drives.) The Celerra detected the problem, and went to do its usual thing: collect logs and other analytics, and then copy them to EMC's anonymous ftp site. Our Celerra uses a utility under /nas/tools called /nas/tools/transfer_support_materials to do this. We noticed that when the Celerra tried to transfer the support materials that too failed. And this generated an additional series of critical errors.

We logged into the Celerra's control station and ran transfer_support_materials by hand. And we saw a message like this:

[nasadmin@controller tools]$ /nas/tools/transfer_support_materials -uploadlog
transfer_support_materials[12057]: The transfer script has started.
PING ftp.emc.com (168.159.219.138) 56(84) bytes of data.
From 12.249.233.6 icmp_seq=0 Packet filtered

--- ftp.emc.com ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
, pipe 2
cd: Access failed: 550 Requested action not taken. File unavailable. (/incoming/APM00000000000)
`/nas/var/emcsupport/support_materials_APM00000000000.130407_1351.zip' at 65536 (0%) 49.1K/s eta:5m [Connection idle]

I've replaced our Celerra's serial number with the string "00000000000".

We then ran ftp by hand to see if we could replicate the error:

nasadmin@controller tools]$ ftp ftp.emc.com
Connected to ftp.emc.com.
220-Proceeding further constitutes acknowledgement
to EMC Acceptable Use and Customer Security policies.
Anonymous uploads are immediately moved to a secure server accessible only
within EMC networks.
File downloads from ftp.emc.com are restricted to selected /pub directories, via
temporary secure accounts or via specific permanent secure accounts only.
Anonymous users please login with anonymous and email address as your password
See Powerlink emc278739 for upload instructions.
EMC staff: please refer to current services, FAQ and Best Practices documents at
http://one.emc.com/clearspace/community/active/css/projects/ftp-service
Please email all questions and concerns to ftpquestions@emc.com
220 Please reference the FTP Acceptable Use policy: http://itcentral.corp.emc.com/Policies/AcceptableUse.pdf
534 Command denied.
534 Command denied.
KERBEROS_V4 rejected as an authentication type
Name (ftp.emc.com:nasadmin): anonymous
331 User name okay, need password.
Password:
230 User logged in, proceed.
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> cd /incoming/APM00000000000
550 Requested action not taken. File unavailable.
ftp>

So, the problem was that the directory that holds our support materials (/incoming/APM<serialnum>) was missing or had its mode set to something that disallowed access.

We contacted EMC, and some days later they confirmed that the problem was indeed that the directory was missing, and that they had recreated it. We then ran ftp by hand to confirm that everything was working again, and it was. That was good news, but when we tried the same thing on our second NS 120 Celerra, we discovered that it too was missing its "support directory" on the ftp server. So we added that trouble report to our service request, and some days later, EMC confirmed that too had been missing, and then again recreated it. In speaking with EMC it is a bit unclear if this problem is particular to us or more broad.

The upshot of the story is that if you too run a Celerra or other product that sends support materials to EMC via anonymous ftp, this might be a good day to test out transfer_support_materials to make sure that your "support directory" is intact. If so, that's great, but if it is missing, you may want to open a service request with EMC soon so that they can recreate the directory for you. Better to have it in place before your system needs to send support materials, but is not able to do so.

I should note that we're still happy overall with EMC; in fact, we've just purchased the first three nodes of a new Isilon storage system from them. So the intent here isn't to excoriate them over the missing ftp directory; it was easy to reproduce the problem and to correct it. But we did wish that we had been able to learn about the problem prior to the disk failure so that it could have been corrected earlier, not when the Celerra was trying to report a disk failure.

ICPSR launches Measures of Effective Teaching web site

2013-04-29T05:00:00.000-04:00

Some of my colleagues, including ICPSR Director George Alter, gave a demo of one of our newest Web sites and collections at the American Educational Research Association 2013 annual meeting on Sunday.

Click the image to navigate to the live site

My team has built the video portal portion of the system. The portal enables a researcher to play a list of videos that s/he has selected to view based on an analysis of the associated quantitative data and tagging data. Access to the video and datasets is restricted and requires one to complete a data use agreement via ICPSR's web-based request system.

We're grateful for the support we've received from the Bill and Melinda Gates Foundation to make all of this possible.

Qualys browser checker

2013-04-22T07:00:00.000-04:00

Ever since the recent craziness with vulnerabilities in Java plugins, I've making a concious effort to use Qualys's browser checker - https://browsercheck.qualys.com/ - on a routine basis both at home and at the office.

Installing the tool in your browser is very easy, and the service is free and painless to use. I have been using it to both to determine if my current browser and plugins are up to date, and also to identify plugins that are installed and enabled, but which I don't really need or use (e.g., Silverlight which I often disable for long stretches at a time).

Qualys generates a nice report

like the one above to let you know if everything is up-to-date.

Web availability at ICPSR - March 2013

2013-04-18T09:16:00.001-04:00

ICPSR's content delivery system showed very high availability in March 2013: a bit over 99.95% uptime. We had only two problems in March. One was a power outage that affected our headquarters on the University of Michigan campus, and we experienced a small amount of downtime as we moved service to our replica in Amazon's cloud. The second was a 21-minute outage due to a continuing -- but now solved, we think -- problem with exporting content from our Oracle database server.

Here are the overall numbers for ICPSR's 2012-2013 fiscal year:

click to enlarge

We replaced our aging Oracle database server with a new machine which has twice the memory, twice the computing power, and perhaps most impressively, has 300 times the disk I/O speed(!). The new machine has an array of solid-state drives (SSDs), and we use this for all of our database storage. (The operating system resides on conventional disk drive technology.)

Web availability at ICPSR - October 2012

2012-11-16T06:00:00.000-05:00

October was a very good month for system uptime - over 99.9% availability:

Click chart to enlarge

That's good news after a much rougher September. So far things look good this month, although a number of very short-lived outages have already pushed us below 99.9% for the month.

A commentary on MOOCs from Clay Shirky

2012-11-14T07:00:00.000-05:00

Some of my colleagues - past and present - are attending classes in Massive Open Online Courses (MOOCs). I've been following their stories and also columnists who have been talking about MOOCs and education. It is a very interesting time.

Clay Shirky has a long post (Napster, Udacity, and the Academy) about MOOCs that is well worth reading. Some highlights:

The recording industry concluded this new audio format would be no threat, because quality mattered most. Who would listen to an MP3 when they could buy a better-sounding CD at the record store? Then Napster launched, and quickly became the fastest-growing piece of software in history. The industry sued Napster and won, and it collapsed even more suddenly than it had arisen.

If Napster had only been about free access, control of legal distribution of music would then have returned the record labels. That’s not what happened. Instead, Pandora happened. Last.fm happened. Spotify happened. ITunes happened. Amazon began selling songs in the hated MP3 format.

and

It’s been interesting watching this unfold in music, books, newspapers, TV, but nothing has ever been as interesting to me as watching it happen in my own backyard. Higher education is now being disrupted; our MP3 is the massive open online course (or MOOC), and our Napster is Udacity, the education startup.

We have several advantages over the recording industry, of course. We are decentralized and mostly non-profit. We employ lots of smart people. We have previous examples to learn from, and our core competence is learning from the past. And armed with these advantages, we’re probably going to screw this up as badly as the music people did.

Nick Carr, MOOCs, and ethics

2012-11-12T16:02:00.003-05:00

In his post The ethics of MOOC research, Nick Carr describes a note he received from a colleague in academia who comments on the research agenda of Massive Open Online Courses:

The MOOCs’ research agenda seems entirely wholesome. But it does raise some tricky ethical issues, as a correspondent from academia pointed out to me after my article appeared. “At most institutions,” he wrote, the kind of behavioral research the MOOCs are doing “would qualify as research on human subjects, and it would have to be approved and monitored by an institutional review board, yet I have heard nothing about that being the case with this new adventure in technology.” Universities are, for good reason, very careful about regulating, approving, and monitoring biological and behavioral research involving human subjects. In addition to the general ethical issues raised by such studies, there are strict federal regulations governing them. I am no expert on this subject, but my quick reading of some of the federal regulations suggests that certain kinds of purely pedagogical research are exempt from the government rules, and it may well be that the bulk of the MOOC research falls into that category.

Given the intense energy ICPSR has been putting its systems for protecting confidential research data and facilitating requests for using such data, I found this very interesting.

I see parallels here with collecting and using personal information. If one conducts a survey and asks personal questions to well-consented adults, the results might one day become an interesting, restricted-use dataset. But if the same information is harvested from freely and openly blogs, tweets, and wall posts, would it also become restricted-use data?

Artificial Intelligence as defined by Nick Carr

2012-10-15T06:30:00.000-04:00

Nick Carr has a short post here marking the occasion of Facebook's one billionth member. He goes on to talk a bit about some work at Google on neural nets, but then includes this gem on artificial intelligence:

Forget the Turing Test. We’ll know that computers are really smart when computers start getting bored. If you assign a computer a profoundly tedious task like spotting potential house numbers in video images, and then you come back a couple of hours later and find that the computer is checking its Facebook feed or surfing porn, then you’ll know that artificial intelligence has truly arrived.

It's a short post and a good read.

September 2012 deposits at ICPSR

2012-10-10T07:00:00.000-04:00

The numbers from September are in:

# of files	# of deposits	File format
29	1	F 0x07 video/h264
145	17	application/msword
5	1	application/octet-stream
292	11	application/pdf
14	5	application/vnd.ms-excel
1	1	application/vnd.ms-powerpoint
120	1	application/x-arcview
31	1	application/x-dbase
1	1	application/x-rar
24	6	application/x-sas
14	6	application/x-spss
13	5	application/x-stata
5	5	application/x-zip
1	1	image/jpeg
32	1	image/x-3ds
70	2	multipart/appledouble
10	3	text/plain; charset=unknown
62	18	text/plain; charset=us-ascii
2	1	text/rtf
29	2	text/xml

Interesting month in that we have the usual stuff in the usual quantities, but we also have a large number of unusual formats hitting the doorstep, such as ArcView and Apple Double. And we also have a usual format in an unusually high quantity (MS Word).

September 2012 web availability

2012-10-05T06:30:00.000-04:00

September was an OK, but not great month for web availability:

Click to enlarge

We eliminated one frequent, but short-lived source of downtime when we stopped exporting the content of our Oracle database nightly. We are now doing it only on the weekend, and while that adds some risk, we're gaining significant uptime. (For some reason that we do not understand, our Oracle instance stops answering queries for 15-20 minutes about ten minutes AFTER the export completes.) We have a new server racked and ready to install, and we're hoping that a fast new machine with solid-state drives will solve the problem for us.

We did run into some trouble mid-month when some routine maintenance went awry, and we had to fail over to our replica over the weekend of September 15 and 16. The total amount of downtime was about 90 minutes total over the course of the weekend, but the replica kept the problem from clobbering our service completely.

After that we had pretty smooth sailing for the rest of the month. Just 16 minutes of downtime for the rest of the month.

Setting up Kaltura - part VI

2012-10-03T06:00:00.000-04:00

We'll focus on the Kaltura Drop Folder feature today. The Drop Folder offers a mechanism whereby an enterprise can bulk upload content without human intervention. In principle this is an excellent way for a library or archive to ingest many objects into Kaltura without some poor archivist performing individual (or group) uploads via a web GUI. In practice the mechanism works smoothly when things are going well, but it can be a little difficult to diagnose problems when things go awry.

For example, here's a sample display from the Drop Folders panel from our Kaltura Management Console (KMC), which serves as an all-in-one dashboard for managing content:

Click to see a full-size image

According to this display we have just ingested three items: an XML file and two video files. In this particular case the XML file contained all of the metadata for the two video files, and contained instructions that told Kaltura that these were new items to ingest. The Status field shows a value of Done, and the Error Description field is empty. This seems good.

We can also see status information if we navigate to the Upload Control panel and select the Bulk Upload view. Here we see similar info:

Click to see a full-size image

Again, this seems like good news. The Notification column shows a value of Finished successfully. Hooray!

But not so fast, my friend....

If we examine one of the video files under the Content panel (Entries tab), we see that none of the extended metadata is present. We can see the Custom Data fields, but they are all empty. Hmm, what happened?

If we navigate back to the Upload Control display, the last column offers some possible help:

There is an Action available to download a log file. That sounds promising. Let's do that.

The log file is in XML format, and if we open it up in a good browser or text editor or XML editor, we find XML that looks very much like the ingest XML we used in the Drop Folder. And if we scroll all the way down to the bottom, we find this snippet:

<item><result><errorDescription>customDataItems failed: invalid metadata data: Element 'METXVideoSubmissionElectronicBoardUsed': [facet 'enumeration'] The value ' ' is not an element of the set {'Y', 'N'}. at line 87 Element 'METXVideoSubmissionElectronicBoardUsed': ' ' is not a valid value of the local atomic type. at line 87 </errorDescription></result>

This is telling us that we messed up one of the metadata fields. If we look at the original ingest XML and find the statement that is supposed to be setting METXVideoSubmissionElectronicBoardUsed, sure enough, there is no value. (The error occurs on line 296, not 87, which is a bit confusing.)

So the good news is that if we notice the error, we can find a log that will point us at the error. But detecting the error is a little tricky, and it is easy to see how this would be difficult if we were ingesting, say, 100 items at a time. So this is not awful, but is also not quite as nice as we might like.

Suggestions:

If the XML contains Custom Data, and the Custom Data has errors, but the video still ingests, perhaps a Status of something like "Done with errors" (in the Drop Folders display) or "Finished with Custom Data errors" (in the Bulk Upload Log display).
Make the diagnostic message (errorDescription) available without needing to download a file. This could appear in a new column, or perhaps in a text pop-up.
If N - 1 elements of the Custom Data are good, but one is bad, it would be nice if the other N - 1 Custom Data fields are set. That would make it possible to correct the error manually in the KMC rather than copying fresh XML into the Drop Folder.
Suppress the line numbers since they are relative to the log file XML, not the original XML.

Again, overall the Drop Folder feature is very nice, and we will indeed use it to ingest the 20-30,000 video files in our collection. But since it is likely that we will sometimes make a mistake within the XML (say, forgetting to escape a certain character), it would be great if the KMC would make it hard to detect and diagnose mistakes.

ICPSR Director of Curation Services

2012-10-01T06:30:00.000-04:00

We have interviewed all of the candidates.

See http://dilbert.com/strips/comic/2011-10-30 for the full cartoon

I had the opportunity to meet with several of the candidates, and we have several excellent ones. With a little bit of luck we should be able to announce who will be filling the position sometime soon.

And then s/he can explain exactly what Curation Services are. :-)

Be careful when answering, "Both"

2012-09-28T07:00:00.000-04:00

When I am working with someone to work through the requirements of a new system or project, I will often ask a series of questions that help shape my understanding of what the person wants. Often these fall into a pattern where I ask a series of either/or questions, like this:

Do you want it optimized for security or ease of use?

Does this fit into a wholesale or retail delivery paradigm?

Will this be used by external customers or internal staff?

Is this intended to drive new revenues or decrease current costs?

In many ways this is like going to the ophthalmologist who has you look through lens A and then lens B, and then asks the question, "Which was better, A or B?" Both of us are trying to bring the problem into focus.

The single most dreaded (by me) answer to these questions is: Both.

In some cases this answer really means, "I am not sure what I want." Or, "I'm too busy to think about this." Or, "I don't care."

This, obviously, does not help when gathering requirements. And so it is a real barrier to scoping the project. Sometimes, of course, an answer of Both is a fine start to a longer answer.

We really do need both in this case. We want to build a system for managing metadata that can be used by both the staff and external people equally easily. We are changing our entire workflow so that either population can manage our metadata, and this is our new business practice.

That is a fine use of Both. In fact, if the person gave a different answer, we might needlessly limit the usefulness of the system we build.

And another fine answer, just like we sometimes tell the ophthalmologist is I don't know. There's nothing wrong with that answer. However, just like with the ophthalmologist, when I hear this answer, I reach into my bag of lens, and try another pair to bring the issue into focus.

Kaltura pilot at the U-M and ICPSR

2012-09-26T06:30:00.000-04:00

The University of Michigan in-house newspaper/newsletter ran a nice piece on the Video Contement Management pilot (using Kaltura), which includes our Bill and Melinda Gates Foundation project: Measures of Effective Teaching - Extenstion.

Here is the part about us:

• ISR and the School of Education are engaged in a collaborative research project in which a large collection of video assets is a primary data-type. This data will become a shared repository available to research partners at universities, public agencies, and private foundations.

The timing here was perfect for us since we were in the market for a system to manage and stream about 20TB of video to thousands of simultaneous users.

DuraSpace announces SDSC as storage partner

2012-09-24T06:00:00.000-04:00

DuraSpace announced recently their relationship with SDSC as a storage provider for DuraCloud. As I posted a while ago, we have been using both DuraCloud and their SDSC storage partner for a while. It's great to see DuraCloud continue to grow.

Setting up Kaltura - part V

2012-09-21T06:30:00.000-04:00

In this post we will look at the XML we use to ingest content into Kaltura through its Drop Folder mechanism. To re-cap an earlier post about the Drop Folder, ours is a subdirectory under the home directory of the 'kaltura' user. We provisioned this account on a special-purpose machine that Kaltura accesses via sftp to fetch content without human intervention.

Kaltura has a pretty nice guide to building the XML, which looks an awful lot like Media RSS. And they also make some short examples available, but we always find it useful to have a real-world example. Here's ours.

Preface: Our stuff is a little unusual. That is:

We always have pairs of videos, one classroom and one blackboard
We have lots of metadata and it applies to both videos, and so the metadata gets repeated in the XML. I have excised much of the metadata in our XML for this post
All of the metadata is fake; it is not real metadata about an actual classroom video
You can find a copy of the XML that we diagram below at this URL http://goo.gl/OxbyU
Our use of Kaltura is in support of the Measures of Effective Teaching (Extension) project, and so there are many references to 'metext' in the metadata
We will be generating the XML for ingest programatically

Here goes.

 <?xml version="1.0"?>  
 <mrss xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="ingestion.xsd">  
     <channel>  
         <item>  
             <!-- These do not change -->  
             <action>add</action>  
             <type>1</type>  
             <!-- This changes for each video -->  
             <referenceId>12345-Board-video-rendition-MP4-H264v-Standard.mp4</referenceId>  
             <!-- This does not change -->  
             <userId>metext</userId>  
             <!-- This changes for each video -->  
             <name>RCN 12345 board video</name>  
             <!-- This changes for each video -->  
             <description>RCN 12345 board video</description>

We have the usual XML stuff at the beginning, and then the start of the Media RSS.

Action = add for new content.

Type = 1 for video content.

ReferenceID is the name of the original file.

UserID is the pseudo-user who will be linked to the content in Kaltura.

Name and Description are exposed as base metadata in Kaltura.

             <!-- Always assign two tags, one called metext and the other board or classroom -->  
             <tags>  
                 <tag>metext</tag>  
                 <tag>board</tag>  
             </tags>  
             <!-- This does not change -->  
             <categories>  
                 <category>metext</category>  
             </categories>  
             <!-- This does not change -->  
             <media>  
                 <mediaType>1</mediaType>  
             </media>

We tag everything with the project name and indicate whether it is a video of the blackboard or the classroom.

Kaltura uses Categories as a main way to browse and find content. We treat this as if it were a type of "is in collection" sort of attribute.

MediaType = 1 for video.

             <!-- This changes for each video -->  
             <contentAssets>  
                 <content>  
                     <dropFolderFileContentResource filePath="12345-Board-video-rendition-MP4-H264v-Standard.mov"/>  
                 </content>  
             </contentAssets>

This tells Kaltura that the video file is in the Drop Folder along with the XML.

Now for our project-specific metadata, which fits into a Kaltura structure called Custom Data:

             <!-- This changes for each video -->  
             <customDataItems>  
                 <customData metadataProfileId="22971">  
                     <xmlData>  
                         <metadata>  
                             <METXDistrictDistrictName>Ann Arbor</METXDistrictDistrictName>  
                             <METXDistrictDistrictNum>20</METXDistrictDistrictNum>  
                             <METXSchoolSchoolName>Huron High School</METXSchoolSchoolName>  
                             <METXSchoolSchoolMETXID>33</METXSchoolSchoolMETXID>  
                             <!-- Kaltura wants the date to be in xs:long format, that is, seconds from the epoch -->  
                             <METXVideoSubmissionCaptureDate>1344440267</METXVideoSubmissionCaptureDate>  
                         </metadata>  
                     </xmlData>  
                 </customData>  
             </customDataItems>

The ID attribute is from our Kaltura KMC. I had to create the Custom Data schema first, and then reference it in the ingest XML here.

Most of the metadata fields are simple strings or strings from a controlled vocabulary. We do have one date item, and sadly Kaltura expects it to be in a difficult-to-use format, seconds since the epoch.

After this section of the XML is a closing tag for item, and then the whole thing repeats with only minor variation for the classroom video. I'll include it below for completeness.

         </item>  
         <item>  
             <!-- These do not change -->  
             <action>add</action>  
             <type>1</type>  
             <!-- This changes for each video -->  
             <referenceId>12345-Classroom-video-rendition-MP4-H264v-Standard.mp4</referenceId>  
             <!-- This does not change -->  
             <userId>metext</userId>  
             <!-- This changes for each video -->  
             <name>RCN 12345 classroom video</name>  
             <!-- This changes for each video -->  
             <description>RCN 12345 classroom video</description>  
             <!-- This changes for each video -->  
             <tags>  
                 <tag>metext</tag>  
                 <tag>classroom</tag>  
             </tags>  
             <!-- This does not change -->  
             <categories>  
                 <category>metext</category>  
             </categories>   
             <!-- This does not change -->  
             <media>  
                 <mediaType>1</mediaType>  
             </media>  
             <!-- This changes for each video -->  
             <contentAssets>  
                 <content>  
                     <dropFolderFileContentResource filePath="12345-Classroom-video-rendition-MP4-H264v-Standard.mov"/>  
                 </content>  
             </contentAssets>  
             <!-- This changes for each video -->  
             <customDataItems>  
                 <customData metadataProfileId="22971">  
                     <xmlData>  
                         <metadata>  
                             <METXDistrictDistrictName>Ann Arbor</METXDistrictDistrictName>  
                             <METXDistrictDistrictNum>20</METXDistrictDistrictNum>  
                             <METXSchoolSchoolName>Huron High School</METXSchoolSchoolName>  
                             <METXSchoolSchoolMETXID>33</METXSchoolSchoolMETXID>  
                             <!-- Kaltura wants the date to be in xs:long format, that is, seconds from the epoch -->  
                             <METXVideoSubmissionCaptureDate>1344440267</METXVideoSubmissionCaptureDate>  
                         </metadata>  
                     </xmlData>  
                 </customData>  
             </customDataItems>  
         </item>  
     </channel>  
 </mrss>

And that's it.

An Inconvenient Outage

2012-09-19T07:30:00.000-04:00

Some of you probably noticed that we had a rough weekend with the web site. We first saw trouble around 4pm EDT on Saturday. After some trouble shooting and investigation left us unsure of the root cause, we failed things over to our replica around 5pm EDT. We then ran off the replica over night and through the following morning.

The big breakthrough came at 1:30pm or so Sunday when we isolated the cause, and then it took only a few minutes to correct the problem, test the solution, and finally roll service back to the production site. As with any longer outage this one pointed out a bunch of small, but important, changes to make in procedures and documentation.

My apologies if you happened to be using our web site late afternoon on Saturday; the was certainly the roughest time.

Introducing ICPSR's Virtual Data Enclave (VDE)

2012-09-17T06:30:00.000-04:00

The ICPSR Virtual Data Enclave (VDE) is a secure, virtual environment in which a researcher can analyze sensitive data, create research products, and then take possession of those products and analysis. And while he VDE is not a substitute for a physical enclave and the types of security protocols it facilitates, the VDE is very much a potential substitute for the traditional practice of distributing confidential data via removable media, such as CD-ROMs.

The VDE uses much of the same technology that ICPSR uses internally for its Secure Data management Environment (SDE) which we have described a few times. In brief, we use a virtual desktop environment that is operated by the University of Michigan's central IT shop and connect it to what we call our Private Network Attached Storage (NAS) appliance. Both the virtual desktop and NAS are behind a firewall, and we use the firewall and Windows group policies to restrict what actions one pay perform. Download? Nope. Cut-and-paste between the virtual desktop and the real desktop? Uh uh. Capture screenshots by taking a picture of your monitor? Well, ......

The virtual environment keeps sensitive datasets under lock and key at ICPSR, but makes it available to researchers. The environment contains the usual array of applications used in the social sciences (but no email!), exactly the same sort of stuff we might set up for a visiting scholar or OR.

The researcher accesses the environment through a small, easy-to-download and -install client based on VMware View Client. Authentication takes place using standard University of Michigan credentials which we (ICPSR) and others at UMich can issue to "friends." Access between the real desktop and the virtual desktop is encrypted, and we are in the process of adding IPSEC encryption between the virtual desktop and the NAS. (This latter traffic passes over UMich's data backbone, and access to those routers is limited to UMich central IT network engineers.)

The virtual machine is completely ephemeral and can be wiped after each use. Any intermediate research or results are stored on the ICPSR NAS. Our NAS is backed up weekly, and tapes are cycled off-site quarterly. Once the research has been completed ICPSR retains a "just in case you need it" snapshot for up to three years.

Setting up Kaltura - part IV

2012-09-14T06:30:00.000-04:00

We have been working on getting a Kaltura Drop Folder set up. A Drop Folder is a mechanism where an organization spools content to be ingested in a fixed location, and Kaltura polls the location, watching for content to ingest.

In our case it has taken about a month to get the Drop Folder configured, and much of this delay is preventable if you avoid the same pitfalls we did. So in the spirit of giving back to the community, here are seven things to know when setting up a Drop Folder.

Host the Drop Folder yourself, do not host it at Kaltura.
Set up an account and a password on the machine, and share them with Kaltura. To keep things very simple I created an account called 'kaltura'.
Create a subdirectory under the kaltura user's home directory that will actually contain the content to be ingested. To keep things very simple I used the name 'dropfolder'
Make sure that the kaltura user owns the Drop Folder directory, and that its access controls grant appropriate rights to other users that may need to ingest content
Tell Kaltura the name of the machine. To keep things very simple I created a DNS CNAME record, kaltura.icpsr.umich.edu, that points to the right machine.
Be sure you have ssh installed and running on port 22 on the machine. If you normally do not run ssh on port 22 (we don't), don't forget to open a hole in your firewall so that Kaltura machines can reach the Drop Folder.
Tell Kaltura to use sftp and port 22 to connect to your Drop Folder. Do not try to use a port other than 22.

To re-cap the values we used:

Host: kaltura.icpsr.umich.edu
Protocol: sftp
Port: TCP 22
Login: kaltura
Password: XXXXXXXX
Drop folder: dropfolder
Drop folder UID:GID: kaltura:met
Drop folder mode: 2775

(We have a big video project called MET, and automated jobs running with the 'met' GID will need write access to the drop folder.)

You may be tempted to suggest using ssh keys or non-standard ports for ssh. Fight those temptations.

Kaltura will offer to auto-delete content once it has been ingested. Accept that offer.

Know that when you delete items from the Drop Folder status window in your Kaltura KMC it will also delete them from the Drop Folder. This is not obvious, but turns out to be useful.

Now all you need are automated jobs that place content and Kaltura-style Media RSS XML into the Drop Folder. Kaltura has some nice examples on-line, but they are somewhat trivial. We'll post some more complex, real-world examples next week.

August 2012 deposits

2012-09-07T05:00:00.000-04:00

Light month for deposits:

# of files	# of deposits	File format
56	30	application/msword
69	26	application/pdf
9	8	application/vnd.ms-excel
2	1	application/vnd.wordperfect
4	1	application/x-dosexec
24	7	application/x-sas
47	21	application/x-spss
7	2	application/x-stata
26	2	application/x-zip
1	1	image/gif
18	4	image/jpeg
1	1	message/rfc8220117bit
3	3	text/plain; charset=iso-8859-1
17	10	text/plain; charset=unknown
32	9	text/plain; charset=us-ascii
3	2	text/plain; charset=utf-8
39	7	text/rtf
1	1	text/x-mail; charset=iso-8859-1
4	2	text/x-mail; charset=unknown
2	2	text/x-mail; charset=us-ascii
1	1	text/xml

Just the usual stuff, but in pretty low quantities.

ICPSR web availability - August 2012

2012-09-05T06:00:00.000-04:00

August was not our best month.

We did a bit better than 99.3% uptime. Almost all of the downtime is due to a recurring, as-yet-unsolved problem we are having with our Oracle database platform. The primary symptom is that the database platform stops fielding queries for about 5-15 minutes, which disables our production web site. The platform does this about 30-45 minutes AFTER it has finished a full export using the Oracle datapump system.

Because our existing Oracle hardware is old and has a relatively slow disk I/O system, we're going to try to solve this problem by throwing hardware at it. For well under $10k we can replace our five-year-old hardware with something much newer. Goodbye RAID-5 SCSI, hello SSD.