Wednesday, February 29, 2012

Sixteen products or one?

A recent conversation with Nathan Adams, ICPSR's Assistant IT Director for Software Development got me thinking about this....

It's no secret that ICPSR uses a package called Survey Documentation and Analysis (SDA) from UC Berkeley as our on-line analysis system.  But people may be surprised to learn that this one product forms the underpinnings of more than a dozen closely related ICPSR on-line analysis products.

One, Anonymous Analysis : This is where we make a dataset available via SDA and there is no authentication allowed.

Two, Authenticated Analysis : One must authenticate using MyData, Google, or Facebook.

Three, Member Analysis :  One must authenticate and also be using a computer located on the campus (even virtually) of a member institution.

Four, Private Analysis : One must authenticate and the identity used must be a member of a previously created group of identities.

Five through eight, Secure Analysis : Like any of the options above, but where the raw, proprietary, binary data files reside on a separate server, and where the ICPSR web server accesses the content via HTTPS rather than through the filesystem.

Nine through Sixteen, Non-disclosed Analysis : Like any of the eight options above, but where SDA's disclosure.txt controls have been used to attempt to prevent unintentional disclosure.

So sixteen different combinations!  And it is easy to imagine even more cropping up in the months ahead.

My experience is that one ends up with sixteen different online analysis "products" when things grow organically over time.  When things evolve due to a small tweaks in response to requests like, "Hey, could we use SDA for this, but with just one small change ..... ?"

It is easy to see how it happens.  But when things grow over time like this, they end up suffering from a profound lack of design, and end up costing more to maintain.  They are fragile.  They break when you change things, like the hardware.  Or the OS.  Or the NAS.  Or the authentication scheme.  Or the oil in your car.

So probably time to pull back a bit, pull together a team of content owners, and start asking some questions.

If we were going to start fresh today with an on-line analysis system, what should we build?

What sort of access controls are needed to prevent bad guys from using it?

What sort of disclosure mitigation capabilities are required to prevent accidents from happening?

To which populations might we need to restrict access?

What does the user experience look like?  Is this geared for the novice or for expert-in-a-hurry?  Or do we have multiple audiences and so need to build more than one experience?

Time to design.

Friday, February 24, 2012

TRAC: A5.1: Deposit agreements

A5.1 If repository manages, preserves, and/or provides access to digital materials on behalf of another organization, it has and maintains appropriate contracts or deposit agreements.

Repositories, especially those with third-party deposit arrangements, should guarantee that relevant contracts, licenses, or deposit agreements express rights, responsibilities, and expectations of each party. Contracts and formal deposit agreements should be countersigned and current.

When the relationship between depositor and repository is less formal (i.e., a faculty member depositing work in an academic institution’s preservation repository), documentation articulating the repository’s capabilities and commitments should be provided to each depositor.

Repositories engaged in Web archiving may find this requirement difficult because of how Web-based information is harvested/captured for long-term preservation. This kind of data is rarely acquired with contracts or deposit agreements. By its very nature, digital information on the Web is perceived to belong to “everyone and no one.” Some repositories capture, manage, and preserve access to this material without written permission from the content creators. Others go through the very time-consuming and costly process of contacting content owners before capturing and ingesting information. Regardless of process, repositories harvesting and ingesting Web-based materials must articulate their rights issues within publicly accessible policies, and have mechanisms to respond to content owners if the repository’s rights to collect and preserve certain information are challenged.

 Ideally, these agreements will be tracked, linked, managed, and made accessible in a contracts database.

Evidence: Deposit agreements; policies on third-party deposit arrangements; contracts; definitions of service levels; Web archiving policies; procedure for reviewing and maintaining agreements, contracts, and licenses.

The ICPSR Deposit System implements this TRAC requirement.  The system makes the terms of deposit clear, and collects an electronic signature from the depositor.  ICPSR keeps the agreements in perpetuity, even if the depositor later decides that s/he would rather not use ICPSR as a repository.

Wednesday, February 22, 2012

Going Google

The University of Michigan is rolling out Google Apps for Education throughout 2012.  A few of us are in the early (1.0) pilot population, and this group made the jump from a variety of legacy University of Michigan email and calendar systems on January 16, 2012.  I first reported on this new initiative late last year, and it's now time for an update.

I should note that I have been using Google's productivity tools outside my professional life for many years, and so there is not much of a learning curve.  I think this will also be true of some of the more broad population at UMich, but will not be true universally.  And I should also note that I had been using a second Gmail account for my professional life too for the past 2-3 years.  The main driver for me was storage space.  While I'm not a huge fan of Outlook and Exchange, the service operated by ICPSR's parent organization - the Institute for Social Research - was always solid.  However, the killer was that the allowable quota for mail was very low (400MB by default), and so I found it frustrating to always be shuffling email off into either the Trash Can or into PST mailboxes.  It was especially rough when it came time to search for something.

And so the move from a consumer Gmail account that I use for work to a Google Apps Gmail account that I use for work has been a small change.  The change from Exchange to Google Calendar for managing meetings has been a bigger change.  On the plus side I'm finding it much easier to manage a single, coherent picture for meeting invitations; I had been trying to manage everything inside of Exchange before.  However, I probably receive 100 meeting invitations for every one I generate myself, and so I haven't had to spend much time and effort ensuring that meetings I create on my Google Calendar are ending up on the ISR Exchange server intact. In fact, most of the headaches I experience with calendaring are related to cases where someone generates an invite within the ISR Exchange server, but does not include anything in the "body" of the invite.  If I try to "read" the invite on a mobile device (e.g., Safari on an iPad), the meeting invite shows up as an empty message.  And so I then track down a "real" computer to see what the meeting invite is all about.

My main take-away so far is that moving from Exchange to Google would best be done (1) quickly, and (2) all at once.  My sense is that we early adopters will continue to face a few headaches like above until the rest of the organization moves to Google in 3-6 months.

Monday, February 20, 2012

Archival Storage @ ICPSR

I gave this presentation to a group of students who came to ICPSR to hear our Life of a Dataset show.

It contains a couple of pictures from the old ICPSR data warehouse (literally a warehouse) which has been torn down and replaced with a sprawling Costco complex.

And there is also a nice little graphic that shows where we're making copies of things we move into Archival Storage.

Friday, February 17, 2012

TRAC: A4.5: Managing revenue

A4.5 Repository commits to monitoring for and bridging gaps in funding.

The repository must recognize the possibility of gaps between funding and the costs of meeting the repository’s commitments to its stakeholders. It commits to bridging these gaps by securing funding and resource commitments specifically for that purpose; these commitments can come either from the repository itself or parent organizations, as applicable. Even with effective business planning procedures in place, any repository with long-term commitments will likely face some kind of resource gap in the future. The repository must provide essentially an insurance buffer as a first—and hopefully effective —  line of defense, obviating the need to invoke a succession plan except in extreme situations, such as the repository ceasing operations permanently.

Evidence: Fiscal and fiduciary policies, procedures, protocols, requirements; budgets and financial analysis documents; fiscal calendars; business plan(s); any evidence of active monitoring and preparedness.

This is another short entry since this requirement is pretty far beyond the scope of IT...

ICPSR's organizational documents require us to maintain something like a 90-day window of cash flow in our "checking account."  Every so often these funds will be referenced, particularly during the annual budgeting process, and my sense is that these are exactly the types of monies referred to above.

And, again, a lot of ICPSR's fiscal discipline and operations are rooted in being part of the University of Michigan and the Institute for Social Research.

Wednesday, February 15, 2012

Stop using Word!

I felt a rant come over me this week....

Hey, you!  Yes, you!  You know I'm talking to you.

I tried to look at that meeting agenda you sent me.  You know, the one that you sent as an email attachment, and where the attachment is a Word document. But because I was using a web browser to access my Exchange email account, the browser won't show me the attachment unless I was to right-click on it and save it first.

And I was reading this on an iPad!  Sheesh.

So, you know, that meant that I didn't look at the agenda until this morning when I got back into the office.  Oh sure I could have made the time to boot up a Windows machine to try to get to web mail that way, but really.

So when I did open this thing, I found this agenda:

  1. Introductions
  2. Problem statement
  3. Brainstorm solutions

And this couldn't just go into the message as text?

Oh, wait, here comes another one.....

You over there!  Yes, you, the guilty-looking one.

I was looking for the policy on Widget Transformations yesterday on our CMS.  I knew we had a policy because you made me go to all of those policy meetings.  (I wasn't sure at the time that we even needed such a policy, but there you go.)

So I was searching and browsing and clicking and scrolling.  And searching.  And searching.  No luck.

So I finally got so frustrated I asked Fred if he knew where it was.  He found it right away.  (He had it bookmarked.  Good ol' Fred.)

It was a Word document.

Don't you know that our CMS is really lame and it doesn't search these things?

But the worst part is when I opened the policy document.  I figured that it must have been done in Word since it had a lot of extra fancy content.  But here's what I found:

Widget Transformation Policy

It it the policy of ICPSR that no widgets should ever be transformed.

And that was it.

This couldn't just go into the CMS as text?

Whew.  Feeling much better now.

But I wish I had a nickel for every time I couldn't find something or read something because it was in Word, and where it was something very plain and very simple.

Monday, February 13, 2012


According to our web server logs, there aren't many of you who have been using SSLv2 when navigating to a URL that begins with "https" on the ICPSR web site.  And that's good.  Because this is a protocol which has known flaws for protecting information.

The ICPSR technology team disabled SSLv2 on its production web server last week.  Based on the logs we keep, it looks like this change will affect few, if any, web site visitors.

Sunday, February 12, 2012

ICPSR network maintenance

This morning's network maintenance at the University of Michigan is running longer than expected.  ICPSR staff rolled service from the production systems to our replica in Amazon's cloud at 4:55am EST today.  The U-M data networking team now estimates that the work will be completed by 10:30am EST.  Once ICPSR confirms that the maintenance is complete, we will roll service back to the production systems.

You can track the maintenance using the same link we are -

Friday, February 10, 2012

TRAC: A4.4: Managing investments and risk

A4.4 Repository has ongoing commitment to analyze and report on risk, benefit, investment, and expenditure (including assets, licenses, and liabilities).

The repository must commit to at least these categories of analysis and reporting, and maintain an appropriate balance between them. The repository should be able to demonstrate that it has identified and documented these categories, and actively manages them, including identifying and responding to risks, describing and leveraging benefits, specifying and balancing investments, and anticipating and preparing for expenditures.

Evidence: Risk management documents that identify perceived and potential threats and planned or implemented responses (a risk register); technology infrastructure investment planning documents; cost benefit analyses; financial investment documents and portfolios; requirements for and examples of licenses, contracts, and asset management; evidence of revision based on risk.

I don't think we spend enough time and focus at ICPSR paying attention to this sort of analysis.  I think this is true for all areas, including my own technology area.

I think the reasons we don't do this more are not unique to ICPSR.  People are too busy working on the next deliverable to ship, and are so caught up in the fray, there isn't time to step back, take a breath, and look at the big picture.  Technology assets are parceled out to a variety of grants and contracts, each with wildly different purposes, tasks, and deliverables.  And so while the management team can ensure that there is some consistency in design, development, and operating environment, it can feel daunting to try to figure our the Future World for such a disparate set of activities.

The culture, too, is very geared to nailing down the next grant or contract, and so there is usually a process for fitting the new work into the overall framework of the organization rather than seeking out specific projects that will fit an already identified in a plan.  That said, if the goal is very broad ("build muscle with video assets") then there is indeed an effort to find those projects that help meet that goal.

It may be the case that it is time again for ICPSR to take a crank on a very forward-looking mission statement and strategic plan.  And maybe a follow-on to such an activity would be deploying a process that regularly (quarterly?  monthly?) revisits those documents to measure risk, benefit, investment, and expenditure rather than activity, accomplishments, and the like.

Wednesday, February 8, 2012

January 2012 deposits at ICPSR

January looks like it was a very busy month at ICPSR:

# of files# of depositsFile format
202F 0x07 video/h264
74application/msword application/msword
92text/plain; charset=iso-8859-1
14217text/plain; charset=us-ascii
21text/plain; charset=utf-8
11text/x-c; charset=us-ascii
32text/x-mail; charset=unknown

Two items are noteworthy.

One is that we moved a few key systems from older 32-bit machines running older versions of RHEL to new 64-bit machines running RHEL 6.  As it turns out the magic database that file uses on RHEL 6 is in a new format, and did not work well with our local additions (aka localmagic and localmagic.mime for Linux folks).  So my belief is that our file-based format detector threw up its hands more often than usual, and this accounts for the over 1700 unknown (application/octet-stream) format types last month.  I think these are good candidates for a follow-up scan to correct the results.

Two, lots of images.  I know that we are getting a lot of video and images as part of our Bill and Melinda Gates Foundation MET and MET Extension projects, but I also know that none of the files above is from that project.  So where is all of this coming from?  One big deposit....

Network maintenance - Sunday morning (EST) Feb 12, 2012

The University of Michigan central IT organization, ITS, will be upgrading the network gear that connects ICPSR's building to the campus data network.  The work is scheduled to start at 6am (EST) on the morning of Feb 12, 2012 and should take between one and two hours.

The ICPSR IT team will redirect traffic from the production system to our cloud replica during the maintenance period.  The replica runs in Amazon's cloud and features services such as search, analyze, and download, but purposefully does not enable features such as deposit.

Between the cut-over to the replica, the network maintenance, and the fallback to the production system, access will likely be a little rocky next Sunday morning.  Like a freeway during construction, if it is possible to take a detour around ICPSR's web site on Sunday morning, that's the safest route.  But if you find that you need to download some data or use the site, the replica will be available.

Monday, February 6, 2012

Job posting - again

We're posting a job description for a senior software developer for a third time.  If there is a recession in the IT business in SE Michigan, somebody forgot to tell our pool of potential applicants.

This position is very much like other software developer positions at ICPSR.

In practice there is a blend of business analysis, system design, software development, and second-line on-going support for the stuff you write.  Building stuff in java to run under tomcat is a must.  Experience with Oracle, Eclipse, and one or more frameworks is very useful.

The main support for this position is our Bill and Melinda Gates Foundation MET Extension project.  This is a two year grant to build systems that will delivery video content to researchers in the social sciences and education.  We've had a lot of success - like a 0% failure rate - at hiring people to work on project X, and then moving them over to project Y a few years down the road.  Project X is in this case is MET Extension; project Y is unknown.  But so was the MET Extension project 10 months ago...

If you have interest in the position and would like more info, please feel free to drop me a note.  And here's a short-lived, but direct, link to the job

Friday, February 3, 2012

TRAC: A4.3: Transparency

A4.3 Repository’s financial practices and procedures are transparent, compliant with relevant accounting standards and practices, and audited by third parties in accordance with territorial legal requirements.

The repository cannot just claim transparency, it must show that it adjusts its business practices to keep them transparent, compliant, and auditable. Confidentiality requirements may prohibit making information about the repository’s finances public, but the repository should be able to demonstrate that it is as transparent as it needs to be and can be within the scope of its community.

Evidence: Demonstrated dissemination requirements for business planning and practices; citations to and/or examples of accounting and audit requirements, standards, and practice; evidence of financial audits already taking place.

ICPSR publishes an annual report which contains a financial statement, and it also completes many, many regular reports because it is part of the Institute for Social Research at the University of Michigan.

Wednesday, February 1, 2012

January 2012 web availability


If you click on the image, it'll expand into something more easily read. But don't.

January 2012 was not a good month for ICPSR's web delivery system,  We missed our monthly goal of 99%+ availability.  Not by a lot, but still missed it.

Many of the problems this month are related to an upgrade we made to our delivery infrastructure.  We replaced a five-year-old 32-bit machine with a new 64-bit machine with significantly more memory and processing power.  And we upgraded from an older version of tomcat to the latest release.

The maintenance itself accounted for less than an hour of downtime, but various issues related to the upgrade brought  additional short periods of downtime.  A "death by a thousand cuts" sort of thing.

The good news if that the reliability and stability of the system now seems better, and the upgrade has exposed a couple of software flaws that we have been able to correct.  It also has eliminated the last of the 32-bit systems from ICPSR, and so we are less likely to run into problems where we (accidentally) try to run 64-bit binaries and libraries on 32-bit machines.