Wednesday, March 31, 2010

ICPSR's network migration begins

Last month I wrote about ICPSR's new network topology in my TRAC series of posts. We started the move to the new network last week, and the early signs are good, and it is time for an update.

Here's a high-level architectural view of our future network on the left.

We're breaking the current network into three virtual networks, each with a series of firewall rules to protect the people, content, and systems behind the firewall.

Moving from left to right...

Our Public (green) network has the lowest level of security. It hosts our public-facing systems such as our public web server,, the authoritative DNS server for the domain (and many others!), a machine we use to mirror Web portals for partner organizations, and more. Machines on this network have publicly routable IP addresses, and there is generally free and open access between this network, its resources, and the Internet.

Our Semi-Private (purple) network has a higher level of security. It hosts the vast bulk of our network-attached devices: desktop workstations, laptops, printers, private-facing servers (like a DHCP server), and more. Machines on this network have addresses in private IP address space, and use NAT to reach the Internet. Since the addresses are not public, there is no inbound access to the network from outside the University of Michigan network (UMnet) except through Virtual Private Network technology. I'm typing this blog post on a machine connected to the Semi-Private network.

Our Private (red) network has the highest level of security. It hosts a very small number of systems, all of which are used by ICPSR data managers to process the data and documentation we receive from the research community and from government agencies. In general, there is no access between this network and any other network unless it has explicitly been granted by a firewall rule.

Access to this network flows largely through a Virtual Desktop Infrastructure (VDI) service (also red) that we are using. The VDI hosts a collection of virtual Windows 7 desktop systems which contain the same software used on physical desktop workstations at ICPSR. Once one has connected to the VDI, one may access resources on the Private network, such as newly deposited content, ICPSR archival holdings, etc. Access from the VDI to the desktop is also tightly controlled so that one may not, for example, cut-and-paste between the two. Essentially this is a virtual data enclave, but where the client is ICPSR staff, not external data analysts.

Thursday, March 11, 2010

TRAC: B1.1: What we preserve

B1.1 Repository identifies properties it will preserve for digital objects.

This process begins in general with the repository’s mission statement and may be further specified in pre-accessioning agreements with producers or depositors (e.g., producer-archive agreements) and made very specific in deposit or transfer agreements for specific digital objects and their related documentation.

For example, one repository may only commit to preserving the textual content of a document and not its exact appearance on a screen. Another may wish to preserve the exact appearance and layout of textual documents, while others may choose to normalize the data during the ingest process.

Evidence: Mission statement; submission agreements/deposit agreements/deeds of gift; workflow and policy documents, including written definition of properties as agreed in the deposit agreement/deed of gift; written processing procedures; documentation of properties to be preserved.

My sense is that this TRAC requirement is more about policy than technology, but nonetheless I'll comment on it from a status quo, technological standpoint for the purposes of this post. That will give the reader a good sense for how things are working today, but leaves open the possibility (likelihood?) that ICPSR may change its mission statement, submission agreements, and/or other possibilities down the road, and then the underlying technology will change too.

The short version of the story is that ICPSR captures two essential items from each submission: the data (normalized into plain ASCII characters), and the technical documentation (normalized into DDI XML, PDF, and TIF images). And we keep the content in these formats until we need to migrate to a different format.

In our world of survey research data there is no requirement to preserve original formats (such as a SAS Portable file), look and feel (such as original technical documentation written in Word Perfect), or even low-level data format (such as numeric data coded in punched card format). We can take everything, pull out the essential intellectual content, and then normalize that content in quite durable formats.

That said, we may take these durable formats, such as DDI XML metadata and plain ASCII data, and feed them into tools that produce dissemination formats that are easier for our designated community to use. This includes common stat package formats, and also includes formats that feed into our on-line analysis system. However, these dissemination formats are not preserved, and they contain no essential properties not also contained in the preservation formats.

As I've been thinking about this particular TRAC requirement over the past week, I think we're probably missing one important piece of information for each of our data files: the character set. In practice almost all of our holdings are numeric data, and so our implicit character set ("us-ascii") is the correct one to use, but this seems like a good area for being more explicit. This would allow us to handle data files that contain non-numeric, non-US-ASCII characters in a more durable fashion.

Monday, March 8, 2010

ICPSR Web Server Outage

The ICPSR web server was unavailable between 10:51am and 11:11am EST on Saturday, March 6, 2010.

During this time the ICPSR web server had limited availability due to a denial of service attack from an IP address belonging to a broadband network provider in China. ICPSR technical staff applied a temporary fix by denying access to the web server for that IP address. University of Michigan networking staff later applied a more permanent fix by updating access control lists within the university network.

ICPSR is in the process of adopting a centrally managed "virtual firewall" service from the University of Michigan, and that will make it faster and easier to respond to attacks such as these.

Friday, March 5, 2010

TRAC: C3.4: Disaster preparedness

C3.4 Repository has suitable written disaster preparedness and recovery plan(s), including at least one off-site backup of all preserved information together with an off-site copy of the recovery plan(s).

The repository must have a written plan with some approval process for what happens in specific types of disaster (fire, flood, system compromise, etc.) and for who has responsibility for actions. The level of detail in a disaster plan, and the specific risks addressed need to be appropriate to the repository’s location and service expectations. Fire is an almost universal concern, but earthquakes may not require specific planning at all locations. The disaster plan must, however, deal with unspecified situations that would have specific consequences, such as lack of access to a building.

Evidence: ISO 17799 certification; disaster and recovery plans; information about and proof of at least one off-site copy of preserved information; service continuity plan; documentation linking roles with activities; local geological, geographical, or meteorological data or threat assessments.

Building and documenting systems and procedures for coping with a disaster has a scope well beyond just IT. But there are two key areas worth discussing that fall within the purview of IT.

One area is ensuring that ICPSR is able to deliver its content to its clients, members, and the public at all time. This is an area where we've made significant investments over the past twelve months, and where we also now have a good story to tell.

The main ICPSR delivery mechanism is its web site. The technological resources that power the primary instance of the web site reside at ICPSR itself on the campus of the University of Michigan. This consists of three mains systems: a reasonably powerful server running web applications; another powerful server running an Oracle database; and our storage appliance.

Our equipment resides in an eclectic machine room. On the plus side it has items one would expect to find like equipment cabinets, local air handlers providing chilled air, and UPS to protect us from power fluctuations. On the negative side there is no raised floor or cable trays, which makes for a messy machine room, and our connection to Ann Arbor's (not U-M's) electrical grid is somewhat precarious. An off-site network operations center monitors our gear 24 x 7 and notifies us via SMS, pager, and telephone if anything looks broken.

We maintain a replica of our web environment in Amazon's cloud, and we use a simple mechanism to trigger a failover to the replica: The oncall technician changes the DNS record for to point to the replica instead of the primary. The time-to-live on the record is low (five minutes), and so once the process has been followed, failover is quick. (And the change is made on a "stealth" DNS server that also lives in Amazon's cloud.)

And, finally, we synchronize the replica several times throughout the workday so that software and content is always fresh.

This setup doesn't create a web environment which promises "five nines" type of uptime (i.e., 99.999% availability), but it does give us the capability to avoid any long multi-day outage like we saw late in 2008, and it also gives us the capability to deliver content indefinitely from the replica should ICPSR stuffer a disaster.

The second main area where IT plays a key role is with archival storage. This is less about 24 x 7 availability, and more about ensuring that our archival holdings are available for long-term access in a robust storage fabric.

A post from November 2009 is still an accurate depiction of how we replicate our archival holdings so that we can be guaranteed to have a copy even if something catastrophic happens to our main location in Ann Arbor. I'm also interested in deploying copies outside the United States. We've had some very useful conversations with colleagues at the ANU Supercomputer Facility in Australia, and perhaps some sort of reciprocal storage arrangement might be worked out.

Thursday, March 4, 2010

ICPSR Web outage - 1:01 am to 1:11am EST on March 4, 2010

The ICPSR web server was unavailable between 1:01am and 1:11am on Thursday, March 4, 2010.

The root cause of the problem is still under investigation.

When the on-call systems administrator received the page, he restarted the copy of tomcat that hosts our primary web delivery application (icpsrweb). This single restart not only brought the main web site back on-line, it also brought other web applications (Summer Program portal, study search, and others) back on-line as well.

Our apologies for the inconvenience this may have caused our web site visitors.