Monday, January 30, 2012

Customer service, Zingerman's style

Our parent organization, the University of Michigan's Institute for Social Research (ISR), is working with the training component of the Zingerman's family of companies - ZingTrain - to build a customized training module for use at the ISR.  The focus is, of course, on delivering excellent customer service, and I had the opportunity to attend a session led by two ZingTrain consultants.

I don't want to give away too much of their "secret sauce" but I found their interaction with the group engaging and informative.  I almost used the word "presentation" but that feels wrong; it really isn't a monologue whatsoever.  And there are no Powerpoint slides in sight.  As you might expect the ZingTrain folks shared some tips and techniques about how they build the right culture and right processes.  And they brought goodies from the Bakehouse!

I started to think about some of the tips and techniques I've learned to use in the technology business over the years.  In this realm an awful lot of the interaction with others takes place electronically, and so one doesn't have all of the visual cues and tonal cues one normally can use in conversation.  For example, how do you let someone know that if the solution you have offered does not work, you want and expect the person to let you know so that you can keep trying to solve the problem?  How do you let them know that you will own the problem until it is solved?

One easy way, of course, is to be explicit.
If that doesn't do the trick, please let me know.  I have a few other ideas we can try.
By asking the person to return and letting them know that "we" can try some other things, it shows that one is engaged.  It lets them know that this is the start of a conversation, not the end of one.

On the other hand, I will often see people write this instead:
Hope this helps.
I know people often write this with the best of intentions, but consider how people may read it.  It sounds like the conversation is over.  "Here, try this.  I hope it works.  But if it doesn't, it's your problem, not mine." There's no invitation to come back for more advice, more assistance, more analysis if the issue hasn't been resolved.

And that's my customer service tip for the month.

Hope it helps. :-)


Friday, January 27, 2012

TRAC: A4.2: Charting and changing a course


A4.2 Repository has in place processes to review and adjust business plans at least annually.

The repository must demonstrate its commitment to proactive business planning by performing cyclical planning processes at least yearly. The repository should be able to demonstrate its responsiveness to audit results, for example.

Evidence: Business plans, audit planning (e.g., scope, schedule, process, and requirements) and results; financial forecasts; recent audits and evidence of impact on repository operating procedures.




It certainly is the case that ICPSR undertakes a great deal of what I will describe as "tactical" business planning, and does so on a regular basis.  For example, this is the time of the year when ICPSR begins building its draft budgets for its next fiscal year (July 1 - June 30).  For my team this means that we need to look into our crystal balls and guess what our technology equipment and software expenses might be for the following year (7 to 19 months hence) and also allocate our team across two dozen different projects (e.g., 20% of Bob will be on Project X, 10% of Project Y, and the rest of Project Z).  My experience is that we almost always create a budget which accurately reflects the needs and direction of the organization, and..... is almost always missing some major initiative that only appears much, much later in the calendar year.  And so in practice the technology team aligns about 80-90% of its resources with the plan in the budget, and 10-20% end up doing something unplanned and unbudgeted.  This keeps things exciting.

Many of the other workgroups at ICPSR are much smaller than the technology team, and they have relatively long-lived contracts and grants where the budget, deliverables, work scope, etc are reasonably well defined.  I'm sure they also encounter their fair share of curveballs from their funding sources, and also review and tweak their project-year budgets at the tactical level.

So maybe a B+ or even an A- overall for ICPSR on "regular, tactical business planning."  Not bad.

Many organizations do a much poorer job of looking more deeply into their crystal balls,attempting to peer three, four, maybe even five years down the road.  And ICPSR is no exception.  I call this "strategic" business planning, and I view its role as complementary to "tactical" business planning.

If "tactical" business planning helps you figure out what you're going to do the next year, and how you're going to get it done, "strategic" business planning helps you figure out what you're NOT going to do any time soon, and how you've going to avoid heading down the wrong roads.

The output of this type of planning isn't a spreadsheet or a list of tasks.  It isn't necessarily even a list of goals.  Instead it is a shared vision for who you are, what business you are in, and where you think you need to be in three, four, maybe even five years down the road.  Here's an example:
Today we're the largest archive of survey research data in the world.  Our operations are geared to finding, collecting, curating, preserving, and delivering survey data.  We think we are the best at these activities. 
In four years, however, we believe that survey data will account for only a tiny portion of our business.  We believe that video and social media content will be the new core research content of the future, and this content requires expertise and systems very different than we have today.  Our intent is to limit the amount of time we spend growing our staff and growing our systems to support survey data; we will maintain, but not enhance.  We will seek out grants and contracts that allow us to build infrastructure and expertise in these areas.  And we will begin to invest in our people and our systems for video and social media data.  
This may or may not be the right vision and the right "strategic" business plan to make, but it illustrates the idea that the organization has charted a course.  It lets people know where they are heading.  It does NOT tell them how they will get there.  (That needs to happen eventually too, of course.)  It tells people what they are NOT going to do.

It can be really tough to step outside of the fray and the demands of the day-to-day job to think about the longer term, but it's crucial.  Otherwise an organization just keeps heading down the same road instead of looking at other available roads that may lead to better places.

Monday, January 23, 2012

Tech@ICPSR talks about the cloud @ LA2M

Tech@ICPSR will be giving a talk on cloud computing at the February 1, 2011 LA2M meeting.  We'll be talking about the cloud; kind of a high-level overview of what different folks say the cloud is, and some of the consumer- and business-oriented services and systems that live in it.

I'll add a link to the materials shortly after the talk, and, if LA2M adds the video of the talk to their archive, I'll add a link to that as well.

Friday, January 20, 2012

TRAC: A4.1: Business planning

A4.1 Repository has short- and long-term business planning processes in place to sustain the repository over time. 

The repository must demonstrate that it has formal, cyclical, proactive business planning processes in place. A brief description of the repository’s business plan should show how the repository will generate income and assets through services, third-party partnerships, grants, and so forth. As for A1.2 (succession/ contingency/escrow planning), the repository must establish these processes when it is viable to avoid business crises. These questions may be pertinent to this requirement:
  • Under this plan, to what extent is the repository supported, or expected to be supported, by revenue from content-contributing organizations and agencies, such as publishers?
  • To what extent is the repository supported, or expected to be supported, by revenue from subscribers or subscribing institutions?
  • What measures are in place, if any, to limit access by nonsubscribing stakeholders?
  • What financial incentives are offered, if any, to discourage subscribers from postponing their investment in the repository? From discontinuing investing in the repository?
  • To what extent is the repository supported, or expected to be supported, by other kinds of parties?
  • How will major future costs, such as migrations, capital improvements, enhancements, providing access in the event of publisher failure, etc., be distributed between publishers, subscribers, and other supporting parties?
  • What contingency plans are in place to cover the loss of future revenue and/or outside funding?
  • In the event of a catastrophic failure, are reserve assets sufficient to ensure the restoration of subscriber access to content reasonably quickly?
  • If this is a national or government-sponsored repository, how is it insulated from political events, such as international conflicts or diplomatic crises, that might affect its ability to serve foreign constituencies? 
Evidence: Operating plans; financial reports; budgets; financial audit reports; annual financial reports; financial forecasts; business plans; audit procedures and calendars; evidence of comparable institutions; exposure of business plan to scenarios.




An awful lot of this TRAC requirement is met simply by being a unit of the the Institute for Social Research at the University of Michigan.  (This will be a recurring theme across a lot of TRAC A4.)  The U-M, like most large universities, has an elaborate bureaucracy for managing budgets and expenditures, and producing a cornucopia of reports, statements, and other fine documents.

In addition, like a lot of non-profits across the US, ICPSR produces an annual report that it makes available publicly, and this contains high-level documentation for both planning and reporting.  For example, the ICPSR annual report includes a breakdown of revenues and expenses across major areas, and usually contains several essays on recent accomplishments, upcoming initiatives, and areas of focus for the upcoming fiscal year.

Wednesday, January 18, 2012

Disaster Recovery v. High Availability

A question I often receive from customers and colleagues is:  If ICPSR has a replica of its production delivery system in Amazon's cloud, why is it that the web site is sometimes down due to scheduled maintenance or unplanned outages?

The short answer is:  ICPSR's cloud replica serves a disaster recovery (DR) purpose, but not a high availability (HA) purpose.  Of course, more often than not, this generates a look that falls somewhere between Bah! and This sounds like some made-up IT nonsense!  However, it really is the answer.  But that begs the question:  What's the difference between DR and HA?  But first a trip back in time....

As some long-time ICPSR clients may recall, the ICPSR delivery system was off-line for nearly a week during the holiday break between 2008 and 2009.  The root cause was a long power outage due to a major ice storm in the Midwest which knocked out power to many homes and businesses, including many in Ann Arbor.  And because ICPSR resides in a building just a little bit off the University of Michigan's central campus, we're just like any other home or business that waits for DTE Energy to restore power.

As one might expect both myself and the ICPSR Director at the time, Myron Gutmann, were quite anxious for the power to be restored.  The storm had caused so much damage that it wasn't at all clear when the building's power would be restored.  And, after the first few days without power - and heat - the building's pipes were in danger of bursting.  Things were looking pretty bad.

However, as it turned out we had been experimenting with Amazon's new computing and storage cloud just prior to the storm.  It would be pretty easy to stand up a minimal web server in Amazon's cloud, something that would basically say Yes, we know our delivery system is down, and we're sorry about that.  And here's the best guess from the local power company about when power will be restored.  We then worked with some of our colleagues at the University of Michigan and the San Diego Computing Center to update the system that maps names (like www.icpsr.umich.edu) to network addresses so that ICPSR's URLs for its web site would point to this new, minimal web server in Amazon's cloud.  That didn't fix the problem, of course, but it let people know that ICPSR knew there was a problem, and shared the best information we had about the problem.

Once power was restored and the main delivery system came back on-line, I had a long conversation with Myron about how we wanted to position ICPSR for any future problem like this.  What if the building lost power again for an extended period?  What if a tornado knocked down the whole building?  What if the U-M suffered some catastrophic problem with its network?

One option was to change the architecture of ICPSR's delivery systems.  Rather than having a complex series of simple web applications, we could redesign and rebuild the whole system so that it would also contain a middle layer of technology that would catch and route incoming requests to one of many delivery system components.  And rather than having a single production system at the University of Michigan, we would build a multi-site production system spread across multiple network providers and service providers so that no single problem would disrupt services.  This is essentially the high availability (HA) version of ICPSR's delivery system.  It would have the virtue of providing true 99.99%+ reliability, but would cost plenty of money to design, build, and operate.  If you are running IT systems for a bank or a hospital or an aircraft carrier, you build them with HA.  But what about a data archive?

Another option was to keep the ICPSR delivery architecture the same, but replicate it somewhere off-site.  Automated jobs could keep the web content, data content, and web applications synchronized.  And an easy - but manual - process could be used to redirect traffic to the replica when needed.  In this world there would still be plenty of times where a component of ICPSR's delivery system might be off-line due to maintenance or a fault, but if the maintenance or fault was long-lived, then the replica could be pressed into service.  This type of solution would be inexpensive to design, deploy, and operate, and would deliver a credible disaster recovery (DR) story, but would probably only give us uptime somewhere between 99.0% and 99.9%.  Would that be good enough?

In the end, of course, we decided that the best use of resources would be to build a system that would still have some outages from time to time, but which would never again be off-line for an entire week.  We set an availability goal of 99.5% for each month across all components.  That is, every time a single component faults - search, download, online analysis, and so on - it counts against the uptime of the WHOLE system.  And we would leave it up to the judgement of the on-call engineer to decide when a problem was likely to be long-lived enough to warrant a switch to the replica.

So we chose DR instead of HA.

Looking back, my sense is that we made the right decision.  In practice we seem to hit our 99.5% availability goal most months, and because we did not tie up our software and systems development resources on rebuilding the delivery system to guarantee HA, we were able to design and build systems like our Restricted Contract System, Secure Data Environment, and Virtual Data Enclave.  Of course, when we need to perform a major bit of maintenance like last weekend where it is important that we continue to point www.icpsr.umich.edu at the production system rather than the replica, it always makes me wonder about the HA alternative.

Monday, January 16, 2012

Provenance metadata and the OAIS Receive Submission

The FLAME (File-Level Archival Management Engine) project continues to articulate functional requirements for the software system.  So far the process has looked something like this:

  1. Select one of the functional areas of OAIS
  2. Drill down into one of the sub-functions within that area
  3. Enumerate a list of high-level statements that should be true for that sub-function
  4. Translate those high-level statements into medium-level specifications for the software
For example, we tackled one such area before the recent holiday break:
  1. Ingest
  2. Receive Submission
  3. The producer provided basic provenance information at deposit
Of course, this raises the question:  What do we consider "basic provenance information" at the time of deposit?  What information can we collect from the deposited content, and what information do we need to collect from the person performing the deposit?

Here is the draft list we created:
a. FLAME should enable the ability to transfer digital content to ICPSR through web-based file upload
i. Uploading files should NOT require MyData authentication
ii. The act of uploading files serves as a signature for the transfer
b. FLAME should enable the ability to document receipt of content transferred to ICPSR through non-electronic means
i. Date package arrived (required)
ii. Shipping company (required)
iii. Tracking ID number (required)
iv. Other details about the shipment (optional)
c. FLAME should capture the following provenance information from the content provider:
i. Self-reported identity of the content provider, or identity from MyData profile (for electronic transfer)
1. Name of depositor (required)
ii. Self-reported contact information for the depositor, or from MyData profile
1. Email-address (required)
2. Telephone number (optional)
3. Mailing address (optional)
iii. Self-reported descriptive provenance information from the depositor
1. Name or title of deposit (required)
2. Summary or description of the deposit (optional)
3. Name of organization that sponsored the research, or "not applicable" (required)
4. Number of ID of the grant or contract, or "not applicable" (required)
d. FLAME should capture the following provenance information from the files after each content transfer:
i. Date and time at which each file is received
ii. Checksum of each file
iii. MIME type of each file
iv. Original name of each file
v. Packaging information (e.g., file was part of a Zip archive)

What do you think basic provenance information should include?  Does our list look like it captures everything one could reasonably expect to collect at the time of deposit?

Friday, January 13, 2012

TRAC: A3.9: Self-assessment and certification

A3.9 Repository commits to a regular schedule of self-assessment and certification and, if certified, commits to notifying certifying bodies of operational changes that will change or nullify its certification status.

A repository cannot self-certify because an objective, external measurement using a consistent and repeatable certification process is needed to ensure and demonstrate that the repository meets and will likely continue to meet preservation requirements. Therefore, certification is the best indicator that the repository meets its requirements, fulfills its role, and adheres to appropriate standards. The repository must demonstrate that it integrates certification preparation and response into its operations and planning.

Evidence: Completed, dated audit checklists from self-assessment or objective audit; certificates awarded for certification; presence in a certification register (when available); timetable or budget allocation for future certification. 



Like a few of the other A-group TRAC requirements, this one really operates at the uppermost level of the organization, and so it is difficult to address it from the IT perspective.

HOWEVA..... One barrier to implementing a regular certification cycle are some fundamental questions:

Where do I find a list of consultants or analysts that can grant "TRAC certification" to my repository? 
Which organization sanctions those consultants and analysts? 
What does it mean - precisely - to be "TRAC certified?" 
Are there different levels of TRAC certification, much like FISMA levels? 
If I'm already FISMA certified, does that automatically grant TRAC certification for certain items (especially in section C)?


And so on.

It seems like there is a business opportunity here.

For instance, if ICPSR asserted that it was now in the business of reviewing TRAC requirements for organizations, and a team of ICPSR analysts would either certify your data archive as TRAC compliant or would identify clear action items required to become compliant, would that be a useful thing?  Or would other organizations rise up to say, "Hey, who are you, ICPSR, to be granting certifications?"

How should this work?

Wednesday, January 11, 2012

December 2011 deposits at ICPSR

Chart.

# of files# of depositsFile format
9813application/msword
674application/octet-stream
20725application/pdf
1928application/vnd.ms-excel
445application/x-sas
8613application/x-spss
44application/x-zip
21image/jpeg
21image/tiff
121message/rfc8220117bit
77text/html
232text/plain; charset=iso-8859-1
695text/plain; charset=unknown
46813text/plain; charset=us-ascii
72text/rtf
11text/x-c++; charset=us-ascii
11text/x-c; charset=unknown
103text/x-c; charset=us-ascii
41text/xml

Lots of plain text and Excel this month, and not so much from the conventional stats packages.  The usual set of C and C++ bogons that are undoubtedly plain text.  And a large number of files where we could not identify the content (octet-stream) which tells me that we either received lots of binary data, or we are starting to see a new format that our MIME detector can't figure out.

Monday, January 9, 2012

ICPSR web availability through December 2012

Web availability in December was looking very, very good through most of the month.  We had seen only a single noteworthy event the entire month, and that resulted only in a few minutes of downtime.  (As happens from time to time, a member site was scraping our web pages, presumably to collect the metadata we publish.  And while professional scrapers like Google, Yahoo, and the other search engines scrape gently and non-intrusively, this is not often the case with less experienced scrapers.)

Of course, December is always a tricky month here at ICPSR.  Snow storms.  Ice storms.  Power outages.  I can't remember the last time that my entire team was able to take off the entire week between Christmas and New Years (like the rest of the U-M) without having to come into the office to troubleshoot a problem.

And this year was no different.

We started to see sporadic up/down alerts from the U-M network monitoring system on the morning of December 30.  It looked like our production web server was working OK overall, but having some problems. When we tried to load the home page from home, the page wouldn't load.  And when we tried to login (via ssh) from home, the connection timed out.  It looked as if everything was down even though the monitoring system said it was OK.

We found we could log into other systems on campus, and then use those as a launch pad to get to ICPSR.  All of our systems were up, but none seemed reachable from systems off campus.  This explained why the U-M monitoring system didn't through more alarms earlier.

Then we noticed this:
http://status.its.umich.edu/outage.php?id=73300
(I think this link works even from off-campus.)

We then worked with the campus network engineers to draw their attention to the problem that was affecting us.  Unfortunately it was kind of helpful to have the ICPSR web site be unavailable from off-campus as a test case; we would know the network was fixed when the web site was available again.

All in all not a horrible month for availability, but we moved from 99.9% on Dec 29 to 99.5% by the end of Dec 30.

Friday, January 6, 2012

TRAC: A3.8: Information integrity measurements

A3.8 Repository commits to defining, collecting, tracking, and providing, on demand, its information integrity measurements. 

The repository must develop or adapt appropriate measures for ensuring the integrity of its holdings. The mechanisms to measure integrity will evolve as technology evolves, but currently include examples such as the use of checksums at ingest and throughout the preservation process. The chain of custody for all of its digital content from the point of deposit forward must be explicit, complete, correct, and current. The repository must demonstrate that the content it has matches the content it received, e.g., with an implemented registry function that documents content from submission onward. Losses associated with migration and other preservation actions should also be documented and made available to relevant stakeholders. (See C1.5 and C1.6.)

If protocols, rules, and mechanisms are embedded in the repository software, there should be some way to demonstrate the implementation of integrity measurements.

Evidence: An implemented registry system; a definition of the repository’s integrity measurements; documentation of the procedures and mechanisms for integrity measurements; an audit system for collecting, tracking, and presenting integrity measurements; procedures for responding to results of integrity measurements that indicate digital content is at risk; policy and workflow documentation. 



ICPSR operates very differently than a conventional archive, and it really shows when one looks at this TRAC requirement.

A typical workflow for us looks like this:

  1. Receive some content in formats like SAS and Word
  2. Preserve that content "as is" at the bit-level
  3. Completely re-do all of the data and documentation, preserving the intellectual content (modulo disclosure concerns), but reorganizing it all
  4. Produce normalized and ready-to-use content based on the re-do
  5. Preserve the normalized content forever
So at the file-level we track all of the original deposits and all of the content we produce, and we test the integrity of each file every week.  Since my team inherited the responsibility to manage archival storage n 2006 I've never seen a problem that wasn't traced back to a transient error that took place as content was being copied into archival storage, and where the solution wasn't solved when the ICPSR staff member re-ran the copy.

We also track the chain of custody at the aggregate level, assigning each "deposit" and "study" to both a workgroup and an individual, and by linking deposits to studies (and vice-versa).  We have internal systems to manage both deposits and studies, and they include mechanisms whereby a data manager can edit metadata, assign key dates, and enter diary entries, not unlike a trouble ticket or help desk system.

Wednesday, January 4, 2012

Systems Architect Senior job posting @ ICPSR

We've posted another job opening on my team at ICPSR.  I'll include the text from the job description below, but the short-lived link to the U-M job site is http://umjobs.org/job_detail/65035/systems_architect_senior.

In brief we're looking for someone with deep experience building production operational environments for web applications who can apply those skills to the ICPSR environment.  We currently have a mix of stuff running on real hardware at ICPSR and virtual hardware in Amazon's EC2, and we have a mix of legacy Perl CGI code and newer Java-based web applications.  A mix of myself and Steve Burling (who retires at the end of January) have been the main architects of the environment, but it has grown so much over the past 5-10 years that it has become a full-time job.

Here are the details:


Systems Architect Senior

Job Summary

The Inter-university Consortium for Political and Social Research (ICPSR), the world's largest archive of digital social science data, is now accepting applications for a Systems Architect Senior. ICPSR is a unit within the Institute for Social Research at the University of Michigan. ICPSR's data are the foundation for thousands of research articles, reports, and books. Findings from these data are put to use by scholars, policy analysts, policy makers, the media, and the public. This position reports to the Assistant Director, Technology Operations, Computer and Network Services.

Responsibilities*

Responsibilities
This position is responsible for the design, implementation, maintenance, and regular management of ICPSRs web systems development, staging, production, and disaster recovery operational environments. This consists of several distinct platforms, including local physical hardware and virtual systems hosted in Amazons Elastic Computing Cloud (EC2). The successful candidate will also work closely with the Assistant Director, Software Development, Computer and Network Services to define and implement functional requirements.

One, this position will select, install, and manage integrated development environment (IDE) software on developer workstations, and the underlying software repository. The incumbent systems are Eclipse and CVS, respectively.

Two, this position will select, install, manage, and maintain the testing, staging, and production platform environments used by ICPSR to deploy and test new web applications. The incumbent web application server is Apache Tomcat, sometimes run as a stand-alone web server and sometimes as a client to Apache Httpd. The incumbent server platform is a mix of local, physical servers and Elastic Computing Cloud (EC2) instances running within Amazon Web Services (AWS). ICPSR has interest in exploring a more complete. cloud-based web application platform such as AWS Elastic BeanStalk.

Three, this position will manage the AWS-hosted replica of ICPSRs production web environment. This includes building and maintaining tools that synchronize software, static content, and database content between the production environment and the replica environment.

Four, the over-arching responsibility of this position is to maintain and improve ICPSRs capacity for delivering high-availability, high-performance web-based services, managing the tension between the desire to have well-defined, documented, predictable deployments and business processes and the desire to have fast moving, fluid, and flexible deployments and business processes.

Required Qualifications*

BS in Computer Science, Computer Engineering, or at least eight years of experience with designing and managing complex web application hosting environments
Two or more years of experience with J2EE application servers (such as Tomcat)
Two or more years of experience with virtualization products or services, such as Amazon Web Services
In-depth expertise with RedHat Enterprise Linux 5 and 6
In-depth knowledge of networking principles and network support
In-depth knowledge of Web technologies (Apache, Tomcat)
Experience operating an RDBMS (Oracle, MySQL)
Experience working with monitoring tools, control software (CVS , Subversion, Perforce), and build tools (Make , Ant)
Strong understanding of web application architectures
Enthusiastic self-starter who works well with other team members
Excellent inter-personal skills with the ability to communicate clearly to peers, vendors, customers, and colleagues

Desired Qualifications*

MS in Computer Science or Computer Engineering
At least five years of experience with J2EE application servers (such as Tomcat)
Experience with Veritas products (Veritas Netbackup)
Storage experience (EMC)
Experience as an Infrastructure Engineer in a high availability environment
Expertise working with monitoring tools , version control software (CVS , Subversion, Perforce), and build tools (Make , Ant)

Underfill Statement

This position may be underfilled at a lower classification depending on the qualifications of the selected candidate.

U-M EEO/AA Statement

The University of Michigan is an equal opportunity/affirmative action employer.

Monday, January 2, 2012

Tech@ICPSR takes another holiday

Dear Loyal Readers:

We return to our normal antics later this week.

Signed,

Tech@ICPSR