Wednesday, July 27, 2011

ICPSR's Secure Data Environment (SDE) - The Storage

To implement its Secure Data Environment (SDE) ICPSR replaces an aging storage array with two newer systems.  The idea was to use a physical separation between storage devices to help make our data management environment for secure.

In many ways the physical separation of systems is overkill.  There isn't much to be gained at the level of the individual data manager or data handler using two separate storage arrays rather than a single array that has been partitioned into two logical arrays.  However, the real value comes, I think, in protecting the IT team from itself.  And I include myself in that statement too.

It would be easy to have a single physical storage array with multiple virtual storage servers.  That is, one can easily create a chunk of storage -- say, a filesystem called /secretStuff -- and then make it available to one virtual storage server, but not another.  And by using a firewall one could then ensure that people working within the SDE would be able to access /secretStuff, and people working outside the SDE would not.

The risk, however, is that someone creates a filesystem like /secretStuff, and then accidentally makes it available across ALL virtual storage servers.  And therefore, not only are SDE systems able to read files in /secretStuff, but the content also becomes, inadvertently, available to the web server too.  That's not good. 

We therefore placed one of our physical storage arrays on our Private network.  Since this network uses private IPv4 address space, this made the array largely invisible to much of the Internet.  Further, the firewall rules for the Private network are very, very restrictive, and access is available only within the SDE, and to a small number of developer workstations (and then only for ssh access).  We use this storage array for all of our content which is confidential and content which might be confidential.

Our second physical storage array resides on our Semi-Private network.  This too uses private IPv4 address space, and therefore is only accessible to machines within the University of Michigan enterprise network.  We allow access via protocols like NFS and CIFS to the storage array within the University of Michigan environment, and we further manage detailed access control lists for individual NFS exports.  The array provides storage to our web server and other public-facing machines, and also serves as the storage back-end for desktop computers.  For example, if you work at ICPSR, then your My Documents folder maps to this array.

The biggest hurdle in replacing one old storage array with two new systems was separating peoples' storage into two categories:  public stuff that they would need to access from their desktop (e.g., stuff they may want to email to someone), and private stuff that they would need to access from within the SDE (e.g., data files and documentation).  This required a significant investment of time from everyone at ICPSR, and especially the IT staff.  I think I spent about 10-15 weekends in the office between February and May, moving content  between systems, making sure that drive mappings still worked, double-checking checkpoint schedules and backups, etc.

The separation seems to have gone relatively smoothly, at least from the perspective of the IT team.  There were no major snafus during the transition, and the number of trouble tickets was relatively low.

The separation did mean that we needed to create some new systems - and tweak existing systems - to create mechanisms so that content could move between the two systems, but in a controlled way that could be audited later.  I'll describe changes we made to our deposit and release systems in my next post, and will also describe our new data airlock system.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.