Technology at ICPSR: August 2011

Wednesday, August 31, 2011

Harvard Business Review: the Stalwarts

http://blogs.hbr.org/hbsfaculty/2011/08/stop-ignoring-the-stalwart-wor.html

I came across this picture on the Harvard Business Review blog after reading a snippet from it on Boing Boing. The blog post itself is a good read, and is also a quick read. The image and the caption are a link to the post.

The very short summary of the article is that the bulk of the employees in an organization are Stalwarts, a group of people whose motivations and goals are very different than the much smaller groups shown in the corners of the diagram. Managing this group effectively and keeping them engaged in the organization takes effort and strategies that might not work for other groups.

One reason I found this so interesting is that it rings very true at a place like the University of Michigan, and I have found the pace and culture to be very different here than in the conventional business world (or at least the telecom segment of that world).

The only part of the post that didn't match my own experience is the last "myth" the author debunks - Everyone wants to me a manager. My experience has always been that the strongest contributors on the team enjoy building or operating part of the IT ecosystem where they work, and while some choose to do that by playing a leadership role so that they can influence the direction via their decisions and setting priorities, others play this role by building the systems that feed into the overall picture. And so I have always seen very strong contributors who have no desire to manage a group, and, in fact, in some case I thought they would be a disaster as a manager.

Monday, August 29, 2011

DuraSpace is bringing (King) Cloud to researchers!

http://www.flickr.com/photos/kky/704056791

I can't believe that it has been nearly a month since DuraSpace announced its new Direct-to-Researchers (DTR) platform.

What does this mean to the research community?

It's still very early after the announcement, but this new service could give researchers a platform on which to curate, preserve, and deliver their research results. DuraCloud makes it very easy to replicate content across more than one storage provider, and so making additional archival copies becomes much easier. A key question is how the researcher will experience the storage space. If they need to use special tools to move content into and out of DuraCloud, that could be a big barrier to use. But if they can "map a drive" or treat it as a virtual location available from the desktop and the web (like DropBox), that would make it very attractive.

What does this mean to a data archive like ICPSR?

I think this has an opportunity to head in many different possible direction for an "old school" data archive like ICPSR.

One world: If researchers can share and preserve their research output using a public cloud, why would anyone need a conventional data archive? In this world the key organizations are the content holders (the researchers and their cloud platforms) and the organization that can index the content across providers so that people can find what they need. Maybe in this world a place like ICPSR becomes an aggregator of metadata rather than content.

Another world: Researchers and curators collaborate in cloudspace to preserve content, and to publish appropriate elements when desired. In this world curators at a data archive might work with content which lives in the cloud rather than locally, and the preservation and delivery platform is distributed widely rather than at a central organization.

And another world: Researchers use services like DuraSpace DTR during the active life of their research project and through the early phases of its related publication lifecycle, but once the pendulum swings from "exclusive use" to "shared use," the researchers engage a place like ICPSR. They deposit their entire workspace, and the archive organizes and indexes the material to make it more easy to preserve and to share. In this world life might look the same as it does today.

Friday, August 26, 2011

TRAC: B6.8: Generating correct DIPs

B6.8 Repository can demonstrate that the process that generates the requested digital object(s) (i.e., DIP) is correct in relation to the request.

The right material should be delivered and appropriate transformations should be applied, if necessary to generate the DIP. A simple example is that if the repository stores TIFF images but delivers JPEGS, the conversion should be shown to be correct to whatever standards seem appropriate. If the repository offers delivery as JPEG or PNG, the user should receive the format requested. Many repositories may apply more complex transformations to generate DIPs from AIPs.

Evidence: System design documents; work instructions (if DIPs involve manual processing); process walkthroughs; logs of orders and DIP production.

As described in last week's post, humans produce our DIPs, and the DIPs are reviewed before they are released to the public.

It seems if people were getting the wrong format of an item from our web site (e.g., went to download a dataset in Stata format, but SPSS showed up instead), they would let us know. Loudly. :-)

Wednesday, August 24, 2011

E Amazon Unum - Out of Amazon, One

Photo from http://www.flickr.com/photos/polselli/1250189137/

Amazon announced a new "region" in the United States last week: the GovCloud region. The geographic location of the new region is on the US West Coast, but the logical location is Washington, DC.

Amazon says that ALL of its availability regions offer FISMA Moderate security controls, but this region offers one additional feature and demands one additional requirement so that it "supports the processing and storage of International Traffic in Arms (ITAR) controlled data and the hosting of ITAR controlled applications." The post goes on to say that:

As you may know, ITAR stipulates that all controlled data must be stored in an environment where logical and physical access is limited to US Persons (US citizens and permanent residents). This Region (and all of the AWS Regions) also provides FISMA Moderate controls. This means that we have completed the implementation of a series of controls and have also passed an independent security test and evaluation. Needless to say, it also supports existing security controls and certifications such as PCI DSS Level 1, ISO 27001, and SAS 70.

This gets interesting for organizations like ICPSR that conduct a lot of business with the US Government. Earlier this year we spent mounted a significant effort to categorize the security level for content stored in our archive, and then documented the relevant NIST security controls. It is easy to imagine this this type of effort will repeat itself as we interact with more federal agencies, and as those agencies struggle to become compliant with FISMA.

However, if I can short-circuit the process by using Amazon Web Services as my "machine room," and relying on Amazon's existing certifications and controls, then I may be able to ease the burden of writing and maintaining (and possibly implementing!) our own controls. I would not expect to eliminate the entire effort of documenting NIST security controls, but I may be able to point to Amazon's existing controls and documentation for, say, those controls related to the physical machine room. And remote access.

Indeed, instead of an AWS-hosted instance creating a barrier to a project ("oh no, if we build this in the cloud, we'll need to re-do all of the relevant NIST controls!"), it would facilitate the project.

Monday, August 22, 2011

Happy Birthday! Happy Anniversary!

It is the season for birthdays and anniversaries.

ICPSR is turning 50 years old, and will be hosting many events. We launch the year-long celebration at our biennial meeting of our Official Representatives this October. We'll host a reception at the American Political Science Association in September. And you can also find us in Las Vegas at Caesar's Palace during this year's American Sociological Association meeting. And, of course, there is our nifty new ICPSR@50 web site.

And what happens in Vegas stays in Vegas.

My tenure as the IT leader at ICPSR will turn 10 years old this fall. Since 2002 the size of the software development team, which spends most of its effort building content delivery systems for grants, contracts, and our membership business, expanded from two to eight. The number of servers - real and virtual - has expanded from two to two dozen (real) and two dozen (virtual). The amount of disk storage available has increased from less than 1TB to over 100TB (and even more if we count off-site archival storage locations where the storage is managed by someone else). ICPSR has moved into social networking, cloud computing, disk-based archival storage for preservation, and virtualized access for sensitive data. It has been a very exciting time!

And, finally, the tech@ICPSR blog turns 200 (posts) with this entry.

Friday, August 19, 2011

TRAC: B6.7: Generating complete DIPs

B6.7 Repository can demonstrate that the process that generates the requested digital
object(s) (i.e., DIP) is completed in relation to the request.

If a user expects a set, the user should get the whole set. If the user expects a file, the user should get the whole file. If the user’s request cannot be satisfied, the user should be told this; for instance, resource shortages may mean a valid request cannot be satisfied.

Acceptable scenarios include:

The user receives the complete DIP asked for and it is clear to the user that this has happened.
The user is told that the request cannot be satisfied.
Part of the request cannot be satisfied, the user receives a DIP containing the elements that can be provided, and the system makes clear that the request is only partially satisfied.

Unacceptable scenarios include:

The request can only be partially satisfied and a partial DIP is generated, but it is not clear to the user that it is partial.
The request is delayed indefinitely because something it requires, such as access to a particular AIP, is not available, but the user is not notified nor is there any indication as to when the conflict will be resolved.
The user is told the request cannot be satisfied, implying nothing can be delivered, but actually receives a DIP, and is left unsure of its validity or completeness.

Evidence: System design documents; work instructions (if DIPs involve manual processing); process walkthroughs; logs of orders and DIP production; test accesses to verify delivery of appropriate digital objects.

My sense is that one of ICPSR's strengths is its delivery system for downloading packages of social science research data. Content goes through a fairly rigorous quality assurance process, and we make the content available in the most common open-and-serve formats.

Also, I know that we spend resources and staff time on a regular basis updating the oldest content, fixing it up so that it is easier to use. For example, when we first started making content available as SAS, SPSS, and Stata files, and gave web site visitors the opportunity to select just the format they wanted, we ran into problems with some of the older content. My recollection (somewhat fuzzy now) is that there were cases where studies were organized in odd ways, and one could have the same content spread across several datasets/parts, but in different formats. And this could then lead to very weird behavior if someone picked a format (e.g., SAS) that would leave mysterious "holes" in the download.

Because our DIPs are generated by a human, and reviewed before we place them on the web site for delivery, we should be delivering complete, correct DIPs. Certainly these is no evidence that the content people are downloading is flawed or incomplete on a routine basis (e.g., data without a codebook).

Wednesday, August 17, 2011

Solutions or problems?

When you need help from your IT shop, do you bring them solutions or problems?

If you are bringing them solutions, then you're wasting a valuable resource. If you've already defined the problem, decided on a technical solution, and you're just asking the IT shop to execute your solution, then you are wasting an opportunity to analyze the problem in the first place.

This happens all of the time, of course. It is exactly this sort of situation that creates absurdities where an organization has invested tens of thousands of dollars automating a process, and then discovering much later that the process served no useful purpose. Or an organization builds out new technical infrastructure to support a self-imposed solution when a much cheaper alternative was readily available.

At ICPSR (and at previous jobs) I always enjoy those occasions when somebody stops by my office, and then begin the conversation with: "I have a problem I need to solve. Let me tell you about it." Often it leads to a long conversation about the problem, and then we find that the "problem" is actually something else entirely. And then we solve that problem. Together.

Monday, August 15, 2011

Is ICPSR more like craigslist or more like a newspaper?

Every year on the first Saturday in August our neighborhood association sponsors a subdivision-wide garage sale. This has been going on for about ten years or so.

When we first sponsored this event we placed ads in the classified section of area newspapers (Ann Arbor News, Detroit Free Press, Detroit News), placed "sandwich board" signs at the entrances of the subdivision to catch casual shoppers, and distributed balloons that people could use to "flag" their house as a participant. We also collected a list of participants and their merchandise (via email) and then redistributed that within the neighborhood (again via email) since residents seemed to like getting a "sneak preview."

A few years later we started using a Google Form instead of email to build a roster of participating households. Neighbors enter their address and a block of text about their stuff. It's just a paragraph text box, and so within a certain size limit they can write whatever they want. A few of us monitor the roster collected by the form in case someone has entered a duplicate by mistake, or has entered a new listing that should replace an early one. We also add the original neighborhood lot number. We make this roster world-readable, and we share it with anyone who is interested via a shortened Tiny URL version of the longer Google Forms URL.

We make an announcement about the annual garage sale on our Facebook page and on our neighborhood web site (a Google Site). And we include a link to this URL too. That helps search engines find and index it, which is a Good Thing. We also include a link to a neighborhood map that shows all of the homes and their lot number. (There isn't enough room to show the address nicely.) At this point in the process, we've collected a lot of good information and published it on the Internet in a way that makes it likely to get indexed (and found). But we still haven't done much active promotion.

And "promotion" is where I have seen the biggest change take place.

We posted an ad for the garage sale on craigslist late on a Friday night. The ad was free, and it was easy to add pictures. Adding links to our roster and map was not so easy (I had to write the HTML directly rather than using a nice widget), but the URLs worked. By the next day I could tell from Google Analytics that traffic to our neighborhood web site (particularly the page with the map) went up 30-fold!

We wanted to post an ad in one of the local papers too, but that required a phone call. We tried the number listed, and learned that we couldn't place the ad until Monday. When we called on Monday we found that text was acceptable, but pictures were not, and the price was $40. We included the URLs to the roster and map, mostly so that they would appear in the online version of the newspaper ad. (They did appear, but the links were broken - spaces had been inserted - and the URLs weren't clickable; they were just text.) The on-line ad also included a map showing the location of the garage sale, but the location shown was in a different city. The newspaper people couldn't fix the location on the map, and so they offered to delete the map instead. We accepted.

So as a data provider I had two (non-exclusive) choices: One was free, flexible, highly functional, and demonstrably effective. The other was none of those, but offered the promise of reaching an audience I might not reach with the first choice. And it seems pretty clear that most people aren't even considering this second choice any more. I don't know that we'll bother placing an ad next year in the local paper. I don't know that I'd do it even if it was free.

This got me thinking about ICPSR and its relationship with data providers. How do they perceive us as a place to host their content and reach their audience? Do they see ICPSR as craigslist or as the newspaper classified section?

Friday, August 12, 2011

TRAC: B6.6: Logging access failures

B6.6 Repository logs all access management failures, and staff review inappropriate
“access denial” incidents.

A repository should have some automated mechanism to note anomalous or unusual denials and use them to identify either security threats or failures in the access management system, such as valid users being denied access. This does not mean looking at every denied access. This requirement does not apply to repositories with unrestricted access.

Evidence: Access logs; capability of system to use automated analysis/monitoring tools and generate problem/error messages; notes of reviews undertaken or action taken as result of reviews.

ICPSR maintains access logs using two methods.

One, we maintain a record at the file-level of each successful download. This captures information such as the identity of the downloaded (including anonymous), a timestamp, the web property via which the download took place, the size of the download, and much more. As one might expect this information is used to generate aggregate-level reports for member Organizational Representatives (OR), funding agencies, and so on.

Two, we maintain an error log that shows when something went awry. Summaries of these logs receive a light level of review on a daily basis, but a much more thorough, detailed analysis when someone reports a problem to ICPSR. A typical - but still infrequent - failure-mode is that someone is trying to download member-only data, and the system denies access because it does not believe that they have logged in from a "member-owned" IPv4 address block within the past few months. In the few times I've been involved directly in the trouble-shooting, it has been the case that a campus added a new block of IP addresses, but the OR has not notified ICPSR (probably because s/he wasn't aware of the new block either).

Tuesday, August 9, 2011

Amazon outage bites ICPSR too

Amazon reported a problem with its US-EAST region yesterday evening (EDT time). Their service health dashboard reported that instances (virtual machines) in that region were having problems connecting to the Internet. That was definitely true of our stuff.

We first saw alerts for our (virtual) equipment in the US-EAST at 22:25 EDT. At that time we lost connectivity to every single instance in the US-EAST, but could still reach a small number of instances we have in other regions. This affected the cloud-based replica of our production web server, the Teaching With Data web portal, and our "social login" service. This latter service runs on an Amazon US-EAST operated by a company called Janrain, and isn't part of the instances where ICPSR has direct control.

By 22:56 EDT all of our systems were again reachable from the Internet, and no further action on our part (restoring content, restarting the instance) was necessary.

I have not yet seen a post mortem from Amazon, but based on my time in the data networking biz, my guess is that someone (Amazon or one of its transit or peering partners) made a routing change which blackholed their US-EAST traffic.

All in all Amazon continues to do a very good job with their cloud infrastructure, but this is a reminder that one would need to replicate services across several regions if one was to build a service with a very high level of availability.

Monday, August 8, 2011

ICPSR's Secure Data Environment (SDE) - The Workflow

One of the most challenging aspects of building our Secure Data Environment (SDE) for managing social science research data was redesigning our workflow.

In the past our environment was quite open, and so the workflow did not need to be concerned with certain aspects of access control. For example, if the workflow required an ICPSR data manager to send an email to to the original depositor, and the data manager wanted to include some content that was cut and pasted from the dataset, they would have been able to do that at any point in the process without any special actions. But, in the SDE we have disabled email, and we tightly control how material leaves the SDE, so this sort of free access is no longer available. And so the workflow had to change.

Content arrives at ICPSR through our Deposit Form web application. In brief this is how a depositor transfers content to us and grants us non-exclusive access to manage and share the content. (They can also add descriptive metadata to the content too.) The Deposit Form runs on our public-facing web server, which, of course, does not reside within our SDE.

One change we made, therefore, was to encrypt all content as it arrives. This means that the content isn't available in the clear - even accidentally - on our public web server. Next, an automated job runs on a regular, frequent basis, "sweeping" content from the public web server to the SDE. Once it arrives within the SDE we decrypt the content so that the data manager has easy access to the materials.

Data managers use another web application called the Deposit Viewer to view and manage deposits, and while they can view metadata about the deposit from the desktop and the SDE, they can only download the deposited files from within the SDE. This gives them the convenience of checking on deposit status, for example, from either environment, but ensures that the files do not leave the secure environment accidentally.

All data management functions take place within the SDE. A data manager may move content from the SDE to the outside world, but the transfer takes place via a software airlock. The airlock tracks what has been moved, who has moved it, and requires a supervisor to approve the transfer.

Once the data manager has completed all data processing and quality control, s/he then uses a set of utilities to generate the "ready to go" formats that we distribute via our web site and to release the materials both to the web site and to our archival storage fabric. Like the airlock process above, this step tracks who did what and when and where, and also requires management approval. The archival copies remain within the SDE, and the public-use, ready-to-go files move to our web site. Ensuring that key software systems have access to push files out of the SDE, but ensuring that staff do not, also required a few changes to our workflow.

Friday, August 5, 2011

TRAC: B6.5: Implementing access policies

B6.5 Repository access management system fully implements access policy.

The repository must demonstrate that all access policies are implemented. Access may be managed partly by computers and partly by humans—checking passports, for instance, before issuing a user ID and password may be an appropriate part of access management for some institutions.

Evidence: Logs and audit trails of access requests; information about user capabilities (authentication matrices); explicit tests of some types of access.

For the content we deliver via web download or on-line analysis, we make a record in a database of who accessed what content. We call this database our "order history" system, and as one might expect, we use this information to produce all sorts of reports, the most common of which are to identify aggregate usage by member institution, study, etc.

For the content we deliver via removable media, the access is captured in two ways: (1) the "order history" system above, and (2) our restricted-use contracting system, which records the legal agreement between ICPSR and the data analyst.

Wednesday, August 3, 2011

July 2011 deposits at ICPSR

Another month, another deposit summary:

# of files	# of deposits	File format
3	2	application/msaccess
510	16	application/msword
317	2	application/octet-stream
528	22	application/pdf
22	17	application/vnd.ms-excel
1	1	application/vnd.wordperfect
18	1	application/x-123
115	2	application/x-dbase
3	1	application/x-empty
3	3	application/x-sas
212	12	application/x-spss
5	3	application/x-stata
5	5	application/x-zip
1	1	audio/mpeg
1	1	image/jpeg
10	1	message/rfc8220117bit
2	2	text/html
5	2	text/plain; charset=unknown
232	9	text/plain; charset=us-ascii
1	1	text/rtf
1	1	text/x-c++; charset=us-ascii
50	2	text/xml
12	2	video/unknown

The usual suspects appear in the usual volumes: lots of SPSS, PDF, and MS Word. There seems to be a lot of dBase in this month's report: that is unusual, and is worth investigating. The service that generates MIME type works pretty well most of the time, but is not 100% error-free. And to that point, I suspect that the purported video files and C++ source code are actually something else.

The two deposits with many "unknown" (application/octet-stream) are worth a look too. They may be some esoteric format that we do not see all that often.

Monday, August 1, 2011

ICPSR's Secure Data Environment (SDE) - The Virtual Desktop

A key feature of our Secure Data Environment (SDE), the computing environment in which ICPSR staff manage and process research data and documentation, is that it doesn't exist. At least not in the material world.

The SDE makes use of a University of Michigan system called the Virtual Desktop Infrastructure (VDI) service. According to the U-M web site the service is geared to lowering support costs for departments, and we have found that it is indeed a bit easier to manage virtual machines than physical machines. However, the real selling point for us was that we could centralize access to confidential (and potentially confidential) data in a single space that we could secure, manage, grow, shrink, etc. easily.

We restrict network access to our portion of the VDI using the Virtual Firewall service that I described in an earlier post. That limits the number of potential intruders dramatically. (Anyone who runs a server which is accessible via ssh from the general Internet will know what I mean.) We use University of Michigan-assigned credentials to grant login access to our pool of virtual machines. And because U-M is able to provision credentials from colleagues and associates who are outside of the U-M, we can also grant access to others as needed.

We've found the VDI service itself to be pretty solid overall, and since most of our use falls during the typical workday, we do not find maintenance windows during off-hours or the weekend to be terribly troublesome. There are a few things we'd like to do that are not part of the existing VDI service, and we've found the U-M to be a partner willing to work with us. For example, for a certain class of access we would like to use two-factor authentication and require someone to also enter a one-time passcode from a key fob. That isn't built into today's VDI service, but it may be available in the future.