Friday, July 31, 2009
There have been many times during this process where it has felt like I learned just enough about technology X to get part of the project finished, only to discover that technology Y is also critical. And, of course, technology Y is something where I have little current knowledge.
As I repeated this process, it kept reminding me of one of my daughter's books, If You Give a Mouse a Cookie. The book is an entertaining look at unintended consequences: You give the mouse a cookie, but now it wants milk. So you get the mouse some milk, but now it wants a straw. You get the straw, and .... You get the idea.
And so with apologies to Ms Numeroff and Ms Bond...
If You Give a Mouse Fedora
If you give a mouse Fedora, it'll want to be sure that it is version 3.x
If you give it version 3.x, it'll want to use the new Content Model Architecture (CMA).
To use the new CMA, the mouse will ask you to learn a bit more about FOXML and Content Model objects.
As you work through Content Models, you'll come across WSDL in the Service Definitions and Service Deployments.
WSDL will remind you to brush up on Dublin Core.
So the mouse will ask you to generate Dublin Core (DC) for descriptive metadata, but will want you to use PREMIS for preservation metadata.
Once your objects have been expressed in FOXML and PREMIS and DC, the mouse will want to start adding references to Datastreams. The mouse will need a MIME-type for each Datastream.
To get good MIME types for each Datastreams, the mouse will ask you to use the UNIX file command.
If you use the file command, you'll need to create your own local magic.mime database, since you'll likely have stuff that confuses file.
Once you've added MIME types and ingested your objects, the mouse will want to protect them against unauthorized access. So the mouse will want you to learn XACML.
Once you've learned XACML and written policies to protect your resources, the mouse will want an LDAP server to authenticate users and register attributes.
Once you've deployed the LDAP server, you document all of the steps that you have completed. The mouse will want a place to store this documentation you've written.
And chances are if the mouse wants a place to store important information, the mouse is going to want a Fedora repository.
Sunday, July 26, 2009
But today at 4:09am I hated the cloud.
Jul 26 04:08:00 www dhclient: send_packet: Network is down
Jul 26 04:09:01 www dhclient: No DHCPOFFERS received.
This log snippet above shows the last bits of life for one of our instances, our cloud-based replica web server. This is the cloud version of a star going super-nova, or perhaps collapsing into a black hole. This message means that your instance is no longer on the network, and that means it isn't doing anything useful for you. It's still accumulating $0.10/hour or more of revenue for Amazon, of course.
The problem began during the previous afternoon when the instance went to renew its DHCP lease, as it does every eleven hours. (DHCP is a mechanism to auto-configure system attributes, and like many organizations, Amazon Web Services uses DHCP as the mechanism for virtual machines to get their network configuration).
Unfortunately for our instance, when it went to contact the DHCP server for a renewal, it never got an answer. This by itself isn't alarming; this sort of thing happens all of the time. The instance then got impatient as a five-year-old and starting bugging the DHCP server for a lease renewal. "Are we there yet?" "Are we there yet?" "Are we there yet?" Sadly the DHCP server never responded and after eleven hours of asking, "Are we there yet?" once per minute, our poor instance just gave up.
And that takes our story to just after 4am this morning. That's when the U-M Network Operations Center noticed that our replica was no longer on the network, and that's when their network management system generated a page to wake up the on-call.
I've learned that it's never a good idea to diagnose and trouble-shoot a problem while I'm extremely groggy, and so I took a little time to collect my wits before logging into the system as root and rebooting it. But reboot it I did.
Jul 26 04:26:59 www shutdown: shutting down for system reboot
And then a few minutes later, voila!, the replica was back online. If only I could fall back to sleep so quickly.
I really like the cloud. But since this is the second time we've experienced this problem in 2009 (and others have experienced it too), I really wish Amazon would fix this.
But until they do, I think I'm going to write a little utility that checks /var/log/messages to see if the instance is on the road to network meltdown. If it is, I'd rather know about it sooner rather than later so that I can reboot the instance during the day rather than the middle of the night.
Thursday, July 23, 2009
In addition to the new look and feel, we've also made significant changes "under the hood." Perhaps the two biggest changes are with our search technology and with our overall technology platform.
Our new search technology is Solr, the search engine build on top of Lucene, which is a component of the Apache Project. Solr has all of the features and conveniences one expects in a modern search capability, and is a significant upgrade from the Autonomy Ultraseek product we have been using since the early 2000's.
Our new technology platform is largely Java servlets and JSP. While we still have many significant systems (e.g., our deposit submission system) on our legacy platform (perl CGI scripts), we'll build new systems on this new platform. We've been impressed with the quantity and quality of tools and supporting technologies that play well with Java, and by using JSP we also make it easier for the software developers and web designers to work together.
Tuesday, July 14, 2009
For very sensitive data where a written security plan would ordinarily be required, our system will instead make use of three components: (1) a network scan of systems that will be used to store or analyze the data; (2) a self-conducted audit of the same systems using freely available software tools; and (3) an on-line survey asking basic, non-technical questions. Our intent is to streamline the system significantly, but at the same time raise the bar on the level of system security actually achieved.
We expect to wrap primary development at the end of the summer, and then test the system with specific projects in the fall. We'll then open it up for a bit more testing late in the year, and then perhaps launch the product officially in early 2010.
Monday, July 6, 2009
The first half of ICPSR fiscal year 2009 was a bit rocky. There was one long outage over a weekend in August 2008 where the portion of the web site that renders study descriptions and other dynamic content was not working, and there was a very long outage over the University of Michigan holiday break when ICPSR's building and other parts of Ann Arbor lost power for three days. Since that frosty, dark, cold holiday break we've made some changes to guard against extended outages.
One, we've implemented a 24 x 7 on-call rotation. The person carrying the pager that week is notified by the University of Michigan/Merit Network Operations Center (NOC) when any part of the ICPSR web delivery service has been unavailable for three minutes. This includes the actual web server, the database server, the search engine, and a handful of key web applications that we also monitor.
A recent interruption in service delivery occurred recently when it was my turn to serve in the on-call rotation. A Domain Name Service (DNS) server fault tripped up our web servers at 4:30am on Sunday, June 28, 2009, but because of the early morning page and follow-up phone call from the NOC, all service was restored by 5:00am. Just bringing someone's attention to problems in real-time during off-hours is enormously helpful in minimizing the length of an interruption.
Two, we've deployed a replica of our web systems in Amazon's cloud, Amazon Web Services. We tested the failover process in March, and used it again on May 14 when ICPSR once again lost power. In this case we lost power at about 3:30am on a Thursday, and it wasn't until nearly 1:30pm before we were able to fully recover. However, we were able to fail over to the replica in AWS at 4:30am, and so the outside world would have lost access to ICPSR data for only about one hour.
Because we've had such good success with AWS so far (and such lousy luck with the power in our building) we're likely to make the AWS replica our production system, and use our local system only for backup purposes before the end of this calendar year.
The reliability of the cloud and the transparency of the cloud providers has taken a real beating over the past year. (Here's a recent barb thrown at Google's App Engine cloud platform.) Our experience with the cloud has been very good so far: in addition to monitoring our production systems, the NOC also monitors our cloud replica, and the replica has enjoyed higher uptime than the production system.
My sense is that people are asking the wrong question. Is the cloud 100% reliable? Of course not. But is it more reliable than what you have today? Is it reliable enough to host one copy of your services or content? What's the cost to you (and your customers or members) to achieve 99% (or even 99.9%) availability using your own electrical and hardware infrastructure v. the cloud's?