Technology at ICPSR: August 2010

Tuesday, August 31, 2010

Fooled by Randomness

I just finished reading Fooled by Randomness by Nassim Nicholas Taleb. I read his later book, The Black Swan, and because I enjoyed both his irreverent style and his logical arguments, thought I might also like this book. I did.

In brief, his point is that humans are ruled by their emotions even when they know better, and, in fact, if we did not have our emotions to help guide us, we would find it difficult to make even the most basic decisions. However, we often spend a lot of time fitting random noise in the world, such as the stock market going up or down just a few points, to some seemingly logical story, such as "The market was down today on worries about the job market." We're hard-wired to try to fit facts into causal stories, and then store these stories in our memory rather than the random facts. And when we do this, we can find ourselves in a lot of trouble.

Taleb's book in many ways is a collection of short, highly readable, highly engaging essays, and he spares no effort in skewering what he considers pseudosciences trying to apply the wrong tools and wrong approaches to problems. (He treats economists in particular without mercy.)

Taleb's writing style is very conversational, and even if you don't agree with all of his conclusions, his essays are very engaging and approachable.

Monday, August 30, 2010

Web sites, firewalls, and attackers

Some of you may have noticed that our web server was very slow, or even unresponsive, early in the evening (EDT) on Friday, August 6th. It was slow enough that I received a text from our automated monitoring system. (I was the on-call that week. Three of us take turns, and we hand off the responsibility on Fridays at noon.)

When I checked out our web server I noticed that the load was very high, and that we were receiving dozens of web requests per second from an address located in a broadband block in China. There's no way to know if it was a denial of service attack or something more innocent, like a misguided attempt to harvest some content from our web site. But whatever it was, it was not good.

We haven't invested much resource (yet) in understanding the ins and outs of Apache's modsecurity package, which is how some folks would guard against such attacks. The package has a reputation for complexity, and it isn't often that we find our web site having these sorts of problems. (I'd estimate that we see something like this once every two years or so since I arrived in 2002.)

We have, however, invested in new firewall technology. Once I knew the IP address of the "attacker" it was very easy to add a rule to our firewall blocking all traffic from that source. It took only a few minutes to create and deploy the rule, and the load on our web server immediately dropped off to its usual level. In the firewall logs I could see the "attack" continue for hours after I added the rule, but because of the new firewall rule, it caused no harm. Eventually the "attack" stopped, and the logs were clear by Monday.

Friday, August 27, 2010

TRAC: B1.2: Defining deposit requirements

B1.2 Repository clearly specifies the information that needs to be associated with digital material at the time of its deposit (i.e., SIP).

For most types of digital objects to be ingested, the repository should have written criteria, prepared by the repository on its own or in conjunction with other parties, that specify exactly what digital object(s) are transferred, what documentation is associated with the object(s), and any restrictions on access, whether technical, regulatory, or donor-imposed.

The level of precision in these specifications will vary with the nature of the repository’s collection policy and its relationship with creators. For instance, repositories engaged in Web harvesting, or those that rescue digital materials long after their creators have abandoned them, cannot impose conditions on the creators of material, since they are not “depositors” in the usual sense of the word. But Web harvesters can, for instance, decide which metadata elements from the HTTP transactions that captured a site are to be preserved along with the site’s files, and this still constitutes “information associated with the digital material.” They may also choose to record the information or decisions—whether taken by humans or by automated algorithms—that led to the site being captured.

Evidence: Transfer requirements; producer-archive agreements.

ICPSR is pretty flexible with regard to its deposit requirements. The business rules behind our Deposit Form web app allow one to submit content for deposit at ICPSR with little more than a working title for the deposit and a signature. Of course, the Deposit Form also has many, many fields for collecting additional descriptive metadata, but those fields are not required (although we do appreciate all of the descriptive metadata we can get).

On the back-end of the deposit our business rules require us to create, collect, and store the following information about each file in the deposit:

Original name from the deposit
MIME type
Unique ID (that we create and assign)
Fingerprint and the method used to collect the fingerprint (currently MD5 hashes)
Date of deposit
Location
Identity of the depositor

All of this information is available to ICPSR data managers through a web app called the Deposit Viewer. This is a bit of a misnomer, however, and a more appropriate name for the web app would be the Deposit Manager since it allows data managers to take actions on the deposit, such as changing its status, or downloading the files which were deposited.

Thursday, August 26, 2010

Building systems with Zoho

Nathan Adams, who leads the software development team at ICPSR, told me about Zoho a few months ago. I've been using it a lot recently for a project outside of ICPSR, and I've liked Zoho enough that I thought it was worth telling others about it.

Zoho fits into the "software as a service" space, not unlike the more familiar Google. So just as there is GMail for managing and reading email, Zoho has a similar service. I've been using their Zoho Reports service. Despite the name, Zoho Reports feels more like a little database service to me. One creates tables, and the columns of the tables have typed data (currency, text, dates, etc). One may also create Query Tables whose content is ephemeral: it is fetched from other tables using a language that is SQLish. The Query Tables support only the most basic of SQL queries, but that's OK for the particular application I have.

(I'm on the board of directors of our neighborhood association, and as the secretary, I needed a tool where I could manage information about residents, like email address, telephone number, address, etc, and information about homes.)

Zoho Reports also allows one to display and export the information in a variety of formats. For example, one can both import and export content via CSV format files, and one can export the content of a Query Table in a reasonably attractive PDF report.

I'm currently using Zoho through a free subscription. This limits the number of tables, query tables, records, users, etc that I can use. I'm likely to upgrade to a $15/month subscription that will increase my limits, and also enable an automated database backup facility.

If others have used any of the Zoho apps, I'd be interested in hearing your take on them.

Wednesday, August 25, 2010

DuraCloud pilot update

Things are starting to move along nicely with ICPSR's participation in the DuraCloud pilot. My early experience with the software and tools is mostly positive, but they are still clearly a bit rough around the edges. For example, I've run into bugs on the login screen of the DuraCloud Admin web app that I would characterize as minor, such as needing to use the Submit button on the screen and not the Enter key on the keyboard for some browsers. That said, the DuraSpace people have been fabulous: It's clear they care a lot about the project and the pilot testers, and they have been very, very responsive.

ICPSR is going to test out three parts of DuraCloud.

One, we'll execute a basic upload test, moving content from ICPSR to a "space" in DuraCloud. For this test I'll be using the DuraCloud Admin tool to create a "space," which is basically the same thing as a "bucket" in Amazon's S3. Then I'll use the DuraCloud "sync tool" to copy a subset of ICPSR's archival content to DuraCloud.

Two, we're going to help spec out a "dashboard" or high-level view that shows the integrity of a collection in DuraCloud, and then execute a "fixity" test to measure performance and reliability. For example, if I have a 1TB collection in DuraSpace, and I have that collection replicated across N cloud storage providers, what's my cost to execute such a test every week?

Three, we're going to test a "replication helper" utility that facilitates replicating content across cloud providers. This is a very compelling service for me. Since we already make extensive use of AWS, using DuraCloud as a front-end to AWS is not very interesting for us; but, if we can use DuraCloud as a single front-end to AWS and RackSpace and Atmos and .... then things get more interesting since it means we don't have to develop expertise with ALL of the cloud providers.

Monday, August 2, 2010

Motivation and the annual raise

The University of Michigan has an annual rite: Departments and organizations are allocated a pool of money, typically some small fraction of the total salaries in that department or organization, and managers assign some portion of it to each staff member as an annual increase in base pay. The overall increases are not large by any means, and it isn't that unusual for people to view it almost as a cost-of-living increase.

Along this same topic, I've also been reading a book called Drive written by a gent named Daniel Pink. The book is an interesting read and essentially argues that reward-based systems are just great for mechanical, repetitive work, but actually have the opposite effect if the work requires any sort of creativity whatsoever. In particular, when there is a task or project at hand, it kills performance when there is some type of "IF-THEN" statement made up front, like: "IF you deliver the solution quickly and correctly, THEN you will get a bigger reward." He makes a pretty compelling case.

There's a nice video from TED where Pink presents the core argument from his book too. He's a good speaker, and even if you've read the book, the video is also worth watching.

And so I've been thinking about how Pink's advice in Drive relates to this annual rite at Michigan.

One possible scenario is that Pink is right, and instead of differentiating raises across different staff, everyone should just get the same increase. All of the tech staff are doing creative work on a daily basis, and since we know cash incentives result in worse performance, that makes the most sense.

Another possible scenario is that Pink is still right, but because this process occurs only once per year, the relationship between the performance on any one job or task is almost wholly unrelated to the increase. I cannot think of a single time where I based an annual increase on a single project or task. And because the increase does not take the form of "IF you do a good job on project X, THEN you will get a better raise", maybe it doesn't apply? Maybe the raise is really saying, "BECAUSE you did such a good job on so many different things over this past year, we're recognizing this with a higher increase"?

He does go on to say that when a reward is given after the fact, and is unexpected, this does have the anticipated result, and is motivating. Maybe a higher annual raise is more like this - unexpected, and more like a thank you?