Technology at ICPSR: July 2011

Friday, July 29, 2011

TRAC: B6.4: Access policies and deposit agreements

B6.4 Repository has documented and implemented access policies (authorization rules,
authentication requirements) consistent with deposit agreements for stored objects.

User credentials are only likely to be relevant for repositories that serve specific communities or that have access restrictions on some of their holdings. A user credential may be as simple as the IP address from which a request originates, or may be a username and password, or may be some more complex and secure mechanism. Thus, while this requirement may not apply to some repositories, it may require very formal validation for others. The key thing is that the access and delivery policies are reflected in practice and that the level of validation is appropriate to the risks of getting validation wrong. Some of the requirements may emerge from agreements with producers/depositors and some from legal requirements.

Repository staff will also need to access stored objects occasionally, whether to complete ingest functions, perform maintenance functions such as verification and migration, or produce DIPs. The repository must have policies and mechanisms to protect stored objects against deliberate or accidental damage by staff (see C3.3).

Evidence: Access validation mechanisms within system; documentation of authentication and validation procedures.

Most of ICPSR's access policies are driven by the simple deposit agreements we use. We do not own any of the content; instead, we have a non-exclusive license to preserve and deliver the content for research purposes. And so our function is really to protect it on behalf of others, and make it available to one of two communities: the entire world (if the depositor is a US government agency with which we have a relationship), or ICPSR's membership. And, as mentioned briefly in the requirement above, we tend to use IP addresses to determine whether a given web site visitor is associated with a member institution or not.

In a small number of cases, deposited content will have much more restrictive conditions on the use of the content. Often the precise conditions are not known, and ICPSR later negotiates with the data provider to create an acceptable license (restricted-use data agreement) for the content. In this case the documentation of the access controls are very explicit, and ICPSR retains the executed licenses.

Wednesday, July 27, 2011

ICPSR's Secure Data Environment (SDE) - The Storage

To implement its Secure Data Environment (SDE) ICPSR replaces an aging storage array with two newer systems. The idea was to use a physical separation between storage devices to help make our data management environment for secure.

In many ways the physical separation of systems is overkill. There isn't much to be gained at the level of the individual data manager or data handler using two separate storage arrays rather than a single array that has been partitioned into two logical arrays. However, the real value comes, I think, in protecting the IT team from itself. And I include myself in that statement too.

It would be easy to have a single physical storage array with multiple virtual storage servers. That is, one can easily create a chunk of storage -- say, a filesystem called /secretStuff -- and then make it available to one virtual storage server, but not another. And by using a firewall one could then ensure that people working within the SDE would be able to access /secretStuff, and people working outside the SDE would not.

The risk, however, is that someone creates a filesystem like /secretStuff, and then accidentally makes it available across ALL virtual storage servers. And therefore, not only are SDE systems able to read files in /secretStuff, but the content also becomes, inadvertently, available to the web server too. That's not good.

We therefore placed one of our physical storage arrays on our Private network. Since this network uses private IPv4 address space, this made the array largely invisible to much of the Internet. Further, the firewall rules for the Private network are very, very restrictive, and access is available only within the SDE, and to a small number of developer workstations (and then only for ssh access). We use this storage array for all of our content which is confidential and content which might be confidential.

Our second physical storage array resides on our Semi-Private network. This too uses private IPv4 address space, and therefore is only accessible to machines within the University of Michigan enterprise network. We allow access via protocols like NFS and CIFS to the storage array within the University of Michigan environment, and we further manage detailed access control lists for individual NFS exports. The array provides storage to our web server and other public-facing machines, and also serves as the storage back-end for desktop computers. For example, if you work at ICPSR, then your My Documents folder maps to this array.

The biggest hurdle in replacing one old storage array with two new systems was separating peoples' storage into two categories: public stuff that they would need to access from their desktop (e.g., stuff they may want to email to someone), and private stuff that they would need to access from within the SDE (e.g., data files and documentation). This required a significant investment of time from everyone at ICPSR, and especially the IT staff. I think I spent about 10-15 weekends in the office between February and May, moving content between systems, making sure that drive mappings still worked, double-checking checkpoint schedules and backups, etc.

The separation seems to have gone relatively smoothly, at least from the perspective of the IT team. There were no major snafus during the transition, and the number of trouble tickets was relatively low.

The separation did mean that we needed to create some new systems - and tweak existing systems - to create mechanisms so that content could move between the two systems, but in a controlled way that could be audited later. I'll describe changes we made to our deposit and release systems in my next post, and will also describe our new data airlock system.

Monday, July 25, 2011

An open letter to our storage system

Dear Storage System:

I think it is time that we had a talk. Not a friendly chat, but a serious, heart-to-umm-bus talk. About real issues.

You know, we used to have one of your older cousins live with us. Your cousin's official name was just a bunch of letters and numbers, but we always said "the Naz" for short. Kind of like "the Fonz."

The Naz was sweet. Everyone loved the Naz. When the Naz felt the least bit sick, in a flash there would be a call home, and then before you could say, "broken disk drive," some dispatcher would be on the phone with us, offering to bring out a replacement part that same day. Sometimes we'd even get the phone call before we knew there was a problem. Those were the times!

Sure, some people on campus made fun of us, but we loved the Naz anyway. "Hey, why do you sepnd so much on storage for your Naz? Don't you know you can buy storage from us way, way cheaper?" Yes, they would say such things. Other storage providers can be so cruel.

But we didn't care. We loved our Naz. Dependable, reliable, but, sure, a little expensive. But we thought that we were spending our money wisely on the Naz. Not a bit of trouble. We loved our Naz.

But, as all good things go, so did the Naz. Parts got old. The cabinet got dusty. Sure, a 300GB disk drive seemed big enough back in 2005, but now it just seemed, I don't know, quaint. And, you know, the Naz never had a second data mover....

And, so, after a lot of planning, and a lot of work, we put the Naz out to pasture. The Naz isn't storing and serving data like in the old days; it just mills about the machine room, chewing on electricity, and soaking up the air conditioning. Visitors come by to visit, and they never stay with the Naz for very long.

Of course, as we were reading sweet bedtime stories to the Naz, we were grooming you for the job. You were new and shiny. We didn't know you very well, but we new our Naz, and we knew you were supposed to be just like our Naz. But newer. And better. And with more blue lights. And we were excited.

Things started well. In many ways you were just like the Naz, but better. Your disk drives were hefty. You had newer software. You even seemed a little faster, just like your car does after you wash it. This was great.

But then the behavior problems started. You know what I mean.

Like the time that you knew full well that there was a problem, but did you call home for help? No. You made us do it. Why? Why wouldn't you use the nice telephone line we left for you?

And, sure, after we finally convinced you to call home, you then told the foulest lies.

"We've tried to log in to your system, but the password doesn't work!" the techs would say. Why did you give them the wrong password? Do you think this is a game?

"Please download DiskDebunker v4.3 from our mirror site. Install if on a Windows ME machine, configured for use on a private network, and use it to assess if the storage processor valve flanges are flush. This will produce a 700GB file called DataGrommit.zip, and you should then upload that to the Easy Web support site." they would say other times. I can only imagine the wild tales you must have told to confuse them so. Why couldn't you just be honest and authentic, and tell them that one of your disk drives faulted, and all we needed was a replacement? Why couldn't you be more like the Naz?

"I have read and understand all site messages." said the service requests. Yeah, sure. Why did our little troublemaker say this time?

This behavior must stop!

We're scheduling a little something that some people call "an intervention." We want you to hear the problems that your mischief is causing. We want you to hear the stories. We want you to hear it from the Naz.

We don't want it to end this way. We want it to work. But we've had enough. Don't make us call the people at Property Disposition to come get you. You won't like where they will take you.

Let's give it one more try.

Your friends @ ICPSR.

Friday, July 22, 2011

TRAC: B6.3: Ensuring proper access

B6.3 Repository ensures that agreements applicable to access conditions are adhered to.

The repository must be able to show what producer/depositor agreements apply to which AIPs and must validate user identities in order to ensure that the agreements are satisfied. Although it is easy to focus on denying access when considering conditions of this kind (that is, preventing unauthorized people from seeing material), it is just as important to show that access is granted when the conditions say it should be.

Access conditions are often just about who is allowed to see things, but they can be more complex. They may involve limits on quantities—all members of a certain community are permitted to access 10 items a year without charge, for instance. Or they may involve limits on usage or type of access—some items may be viewed but not saved for later reuse, or items may only be used for private research but not commercial gain, for instance.

Various scenarios may help illustrate what is required:

If a repository’s material is all open access, the repository can simply demonstrate that access is truly available to everyone.

If all material in the repository is available to a single, closed community, the repository must demonstrate that it validates that users are members of this community, perhaps by requesting some proof of identity before registering them, or just by restricting access by network addresses if the community can identified in that manner. It should also demonstrate that all members of the community can indeed gain access if they wish.

If different access conditions apply to different AIPs, the repository must demonstrate how these are realized.

If access conditions require users to make some declaration before receiving DIPs, the repository must show that the declarations have been made. These might be signed forms, or evidence that a statement has been viewed online and a button clicked to signify agreement. The declarations might involve nondisclosure or agreement to no commercial use, for instance.

Evidence: Access policies; logs of user access and user denials; access system mechanisms that prevent unauthorized actions (such as save, print, etc.); user compliance agreements.

Demonstrating that group X has been granted access (correctly) to resource set Y, and that group Z has been denied access (correctly) to the same resource set is either very easy or nearly impossible at ICPSR. Here's what I mean....

Most of our content is public-use and available to the entire world. Access requires only a very weak identify (MyData, or, these days, Facebook or Google IDs) and that the user click through a type of license (our terms of use). Software ensures that the person has authenticated and clicked through our license, and as long as the person performs these two steps, access is granted.

The next biggest collection of content is also public-use, but has one or two simple strings attached. In some cases, the data provider requires that access be anonymous, and so we skip the authentication step. In other cases, the content should be available only to users connected (somehow) to a member institution, and are rules for deciding if someone has such a connection are intentionally liberal. Are you using a computer with an IP address we think belongs to the member? Have you used such a computer within the past six months? Are you the Organizational Representative of a member institution, regardless of your IP address or use within the past six months? Any of these will get one to member-only content.

A small batch of content is restricted-use, and this too is easy. We send the content on removable media once a data use agreement (or contract) has been signed, and so ensuring that the content is going to only the right people is very straight-forward because the number of recipients is very small (i.e., one).

So that's the "very easy" part of the story.

However, there is almost always a very small collection which has very "interesting" access rules. These rules are usually short-lived, complex, and difficult to prove correct. It sometimes depends upon point solutions that need to made "right now."

As one example, I remember a case where we needed to make a certain ICPSR study available to our membership (easy), and to anyone running a browser on a machine with an IP address which was on a special list. This list contained dozens, maybe hundreds, of IP network numbers. Now, an easy mechanism would have been to treat those IP networks as address space belonging to a member institution, but then that would have granted ALL of our member-only content rather than just this one study. So we very quickly built new capabilities into the delivery system so that content could not only be "public" or "member-only" but also "member-only + these guys too". I don't think we needed to use the capability for more than a few months, and, of course, it is very hard to know if it did exactly what it was supposed to. (The cost of error was pretty low, unlike, say, errors made by a bank. Or a nuclear reaction.)

And there is almost always some similar need in production or on the radar screen, and so I consider this collection of ad hoc, short-lived access "solutions" to be the "nearly impossible" part of the story.

Wednesday, July 20, 2011

ICPSR's Secure Data Environment (SDE) - The Network

By the end of 2009 ICPSR's data network looked very much like it had in 1999. It consisted on a single virtual local area network (VLAN) that was home to a handful of IPv4 address blocks. The number of blocks had grown over the decade as ICPSR hosted more equipment in its machine room, such as servers running Stanford's LOCKSS software and Harvard's DataVerse Network (DVN) system. Also, as the ICPSR Summer Program expanded, the number of guest machines and lab machines expanded, and this too drove the acquisition of more network blocks.

The blocks were in public IPv4 space, and therefore in principle, any machine on ICPSR's VLAN could reach any location on the Internet, and vice-versa. In practice some simple devices, such as printers and network switches, used private IPv4 address space, routed only within the University of Michigan. This is a fairly common practice, of course, to conserve IPv4 address space and to protect (somewhat) systems from network-based attacks.

At that time we also made use of simple Cisco access-list rules which acted as a very primitive firewall. The campus network administrators did this for us, but somewhat grudgingly since it was a non-standard practice for them, and made ICPSR's data networking equipment more difficult to manage. And it was also less than ideal for us too since we didn't have regular access to the data networking switches and routers, and so never knew exactly how they were configured at any given time.

So in a nutshell we have a very flat, very basic, and very open network.

This all changed in early 2010 when we started using a new product/service available from the campus network administrators called the Virtual Firewall (VFW). This is based upon a Checkpoint product which (I believe) is often used with commercial network providers who resell network blocks to smaller companies. Within the University of Michigan it is used by departments and organizations like ICPSR who would like all of the benefits of having a firewall, but who lack the resources and expertise to manage all of the infrastructure. In many ways it is the "cloud version" of a firewall, giving one access to the tools to manage access controls, but without the expense of managing the physical firewall itself. This has been an outstanding service.

In addition to using the new VFW we also partitioned our network into four (and later seven!) VLANs:

Public
Semi-Private
Private
Virtual desktops
Virtual Summer Program
Virtual Data Enclave
Virtual Testing and Evaluation

I'm going to skip discussion of the last three VLANs for now to focus on the first four.

The Public VLAN uses public address space and is home to all of our public-facing infrastructure. For example, it is the home of our production web server, our authoritative DNS server, and special-purpose machines running LOCKSS, DVN, etc. Access into and out of this VLAN is relatively open, but we do restrict access to certain protocols for certain machines.

The Semi-Private VLAN uses private address space and is home to all of our non-public, but non-sensitive systems such as desktop computers, printers, and so on. We make relatively light use of the VFW for this VLAN, and outbound access uses NAT so that people can reach the Internet. One of our two EMC NS-120 NAS units also resides on this VLAN.

The Private VLAN also uses private address space, and it contains all of our internal data management and archival storage systems. Our second EMC NS-120 NAS holds this content. This VLAN is heavily controlled via the VFW, and both inbound and outbound access are heavily restricted.

Finally, we use a different VLAN for a pool of virtual workstations that our data managers use to "process" research data and documentation. Like the Private VLAN, this VLAN makes extensive use of the VFW for access control. In many ways the access controls of this VLAN are similar to the Private VLAN, but we have found it useful to use two different network segments, one for the individual virtual workstations and one for the back-end systems.

Friday, July 15, 2011

TRAC: B6.2: Recording access metadata

B6.2 Repository has implemented a policy for recording all access actions (includes requests, orders etc.) that meet the requirements of the repository and information producers/depositors.

A repository need only record the actions that meet the requirements of the repository and its information producers/depositors. This may mean that little or no information is recorded about access. That is acceptable if the repository can demonstrate that it does not need to do more. Some repositories may want information about what is being accessed, but not about the users. Others may need much more detailed information about access. A policy should be established and implemented that relates to demonstrable needs. Are these figures being monitored? Are statistics produced and made available?

Evidence: Access policies; use statements.

ICPSR collects a considerable amount of information about each access: who, what, when, and where (in terms of via which of the properties within the portal was the source). This allows ICPSR to assist users who are having access problems, and to produce summary reports for a member's Organizational Representative or a government agency which relies upon ICPSR to provide access to its content.

ICPSR collects even more information if the content is part of a restricted-use collection. In this case, a research plan, CV, data protection plan, and more are required.

Because a single delivery platform is serving so many different masters (the consortium of members; government agencies; individual depositors; etc), a single policy may not be particularly workable, unless it is necessarily open-ended and broad (e.g., "save as much information as you can since you never know what report you'll need to produce").

Wednesday, July 13, 2011

ICPSR's Secure Data Environment (SDE)

ICPSR has designed, built, and deployed what we call the Secure Data-processing Environment (SDE) over the past twelve months. This is a tightly managed, highly controlled environment in which many members of the ICPSR staff perform their day-to-day data management (data processing) work.

The main business requirement behind the SDE is that it should be difficult, if not impossible, for content to leak out without a member of the staff taking an explicit action, such as running a program which formally releases content on the web site and commits it to archival storage. For example, it should not be possible for someone to upload a data file into a web form, or to attach it to a piece of email.

The design called for many changes to ICPSR's technology infrastructure. We separated our storage into two pools - Private (accessed within the SDE) and Semi-Private (which is more accessible). We separated our network into three main virtual LANs - Private, Semi-Private, and Public. We also updated many, many software systems so that they would operate properly within the SDE. And we also changed processes to conform to the new business requirements. For example, if one process required a data processor to send an email containing a data file to someone else at ICPSR, we changed the process so that email was not required.

I'll post a series of articles over the next few weeks with more details about the SDE and its technology. This will include posts about how we separated storage; how we segmented the network; how we used virtualization technology to solve certain problems; how we changed key software systems; and, how the SDE changed business processes at ICPSR, and how it continues to do so even today.

Monday, July 11, 2011

June 2011 deposits at ICPSR

June 2011 was a very busy time for our deposit system. The number of deposits was pretty typical, but the number of files was enormous.

# of files	# of deposits	File format
2	1	application/msaccess
23	1	application/msoffice
165	23	application/msword
698	4	application/octet-stream
266	28	application/pdf
144	11	application/vnd.ms-excel
1	1	application/vnd.ms-powerpoint
14	2	application/vnd.wordperfect
141	1	application/x-123
4	1	application/x-arc011lzw
25	1	application/x-dbase
23	1	application/x-dosexec
1	1	application/x-empty
1	1	application/x-rar
19	5	application/x-sas
1307	21	application/x-spss
10	4	application/x-stata
3	3	application/x-zip
20	9	message/rfc8220117bit
8	7	text/html
12	6	text/plain; charset=iso-8859-1
10	6	text/plain; charset=unknown
4386	46	text/plain; charset=us-ascii
2	1	text/plain; charset=utf-8
11	4	text/rtf
7	2	text/x-c++; charset=us-ascii
1	1	text/x-c; charset=us-ascii
1	1	text/x-mail; charset=us-ascii
1	1	text/xml
153	2	video/unknown

In addition to the usual suspects like plain ASCII, SAS, SPSS, MS Word, PDF, we also have some of the usual problems, such as files being reported by the automated checker as containing C or C++ source code, when the truth is that they are likely text/plain instead.

One interesting data point is the pair of deposits that contain video files, and lots of them. Upon further review these appear to be vintage SPSS files for the IBM PC. Here's a string that appears in all of the files:

SPSS/PC+ System File Written by Data Entry II

and here is another one:

PCSPSS SYSTEM FILE. IBM PC DOS, SPSS/PC+ V3.0

From a timestamp located nearby, it looks like these files were from 1994. Or maybe they were moved from a mainframe to a PC in 1994?

And there are a few others on the list above that would benefit from some human scrutiny as well.

Friday, July 8, 2011

TRAC: B6.1: Access and delivery options

B6.1 Repository documents and communicates to its designated community(ies) what access and delivery options are available.

Repository policies should document the various aspects of access to and delivery of the preserved information. Generally, the designated community(ies) should know the policies or at least the consequences of them. The users should know what they can ask for, when, and how, and what it costs, among other things. [See Appendix 6: Understanding Digital Repositories & Access Functionality for an in-depth review of digital repository access requirements.]

Repositories might have to deal with a single, homogeneous community or with multiple or disparate communities. Different policies might be needed for different communities as well as for different collection types.

Evidence: Public versions of access policies; delivery policies; fee policies.

Access is one of ICPSR's strong suits. Evidence to support this TRAC requirement can be found across many different pages on ICPSR's public web portal.

For example, if we take study 2999 (Israeli Election Study, 1999 - the first hit when searching for 'election' on the portal), the home page displays a section called Access Notes which makes it clear what it is possible to do with the content.

If one clicks through the link to download content, the next display makes it clear what formats are available.

And there are also pages describing how to become a consortium member, how much things cost if you are not a member, etc.

And if there are versions of content available in both public-use and restricted-use versions, the site also makes that clear.

Wednesday, July 6, 2011

ICPSR web portal availability in 2010-2011

It's that time again: the end of another fiscal year. And that means it is also time for my annual summary of ICPSR web portal availability.

The leftmost month above is July 2010 and the leftmost is Jun 2011. The vertical axis shows availability for each month in terms of a percentage. Our goal is to hit or exceed 99% availability each month.

All in all it was a pretty good year for ICPSR's production web portal. Our web portal hosts many different sites (ICPSR proper, NACJD, NACDA, SAMHDA, DSDR, CCEERC, the ICPSR Summer Program, and many more sites). We were able to exceed 99.75% availability most months, and only had two months (January and June 2011) where our level was a bit lower.

The main culprit of downtime throughout fiscal year 2011 was due to defects in software. As we have been retooling our technology environment from Perl and CGI scripts to Java applications, we have been making greater use of systems like Hibernate and Lucene. My sense is that we're relying more and more on open source middleware, and while that has the advantages of making it easier to develop software quickly, it also means that a problem in the underlying middleware can affect our overall availability. Some of this is due to buggy software; some is due to our learning curve on how to use the software properly; and, some of this is due to getting our arms around the optimal configuration and operation of these packages.

The January 2011 availability level - our lowest month of availability- was due largely to two problems. One was that we scheduled a maintenance window in our server room so that University of Michigan electricians could wire up a new "whole room" uninterruptible power supply, and this, of course, took our production web systems off-line. The other problem was that our regular synchronization process between our production systems and our cloud-based replica had failed in an unusual way that was difficult to detect at first. The database export/import had failed, but only partially, and that produced very odd behavior with our web portal. It took a significant amount of time to isolate the problem, and by the time we had a workaround deployed, the electricians had finished their work, and the production systems were back on-line.

Friday, July 1, 2011

TRAC: B5.4: Maintaining referential integrity

B5.4 Repository can demonstrate that referential integrity is maintained between all
archived objects (i.e., AIPs) and associated descriptive information.

Particular attention must be paid to operations that affect AIPs and their identifiers and how integrity is maintained during these operations. There may be times, depending on system design, when the repository cannot demonstrate referential integrity because some system component is out of action. However, repositories, must implement procedures that let them know when referential integrity is temporarily broken and ensure that it can be restored.

Evidence: Log detailing ongoing monitoring/checking of referential integrity, especially following repair/modification of AIP; legacy descriptive metadata; persistence of identifier/locator; documented relationship between AIP and metadata; system documentation and technical architecture; process workflow documentation.

I've given this TRAC requirement considerable thought, and have searched the web for examples on how others have answered this requirement, but I still don't think I have a firm grasp on exactly what it means, and how I would demonstrate compliance.

It is certainly the case that we have a list of AIPs, and each item on this list contains both a pointer to the content which we're preserving in Archival Storage and metadata about the object. So is that referential integrity? Or is it necessary, but not sufficient, for referential integrity? I don't know.

In our case at ICPSR, we just don't modify or repair AIPs all that often. But if we did, would I need to maintain a log or ledger of the "before AIP" which maps it to the "after AIP"? And having that log would be my evidence of compliance?

I would be interested in hearing from others. How do you interpret this item? What is your evidence?

TRAC: B5.3: Creating referential integrity

B5.3 Repository can demonstrate that referential integrity is created between all archived objects (i.e., AIPs) and associated descriptive information.

Every AIP must have some descriptive information and all descriptive information must point to at least one AIP, such that the integrity can be validated. This should be an easy requirement to satisfy and is a prerequisite for the next one.

Evidence: Descriptive metadata; persistent identifier/locator associated with AIP; documented relationship between AIP and metadata; system documentation and technical architecture; process workflow documentation.

Our descriptive metadata resides in an Oracle database (and it also exported into DDI XML format).

We use one piece of this metadata (the fingerprint) on a regular basis to conduct fixity checks; this is how we validate integrity.