Monday, October 31, 2011

Tech@ICPSR returns from Gartner

Tech@ICPSR is still recovering from a week-long trip to the 2011 North American Gartner Symposium/ITxpo.

The event had nearly 10,000 technology leaders, CIOs, and other folks from the world of IT. The original game plan was to generate a few short blog posts about the event while attending the symposium, but no such luck.  Kudos to those intrepid bloggers who can both attend an event all day and then blog about it all night.

The major themes at Gartner this year were:  cloud, social, video, big data, and managing the explosive growth in the amount of content that is being produced and retained.  ICPSR has its toes in all of these, and so it felt like the program was a very good fit.

In some ways the event was disappointing in that there were obvious take-aways.  But in other ways it was rewarding because it confirmed that ICPSR is on the right track in many of these areas, such as making use of the cloud to solve certain problems.  And so while the speaker, a Gartner Analyst, may have been encouraging the audience members to conduct an experiment with the cloud, ICPSR is already using the cloud as part of its production operations.

The sessions on "social" were the most interesting.  Again, it feels like ICPSR is already making use of social media in some ways, but it also feels like there is a vast, untapped opportunity to use the content that "social media" generates in social science research.  Is there an opportunity for ICPSR to partner with another organization to create, curate, preserve, and disseminate datasets derived from Twitter feeds?  Facebook walls?

Photo credit:  http://farm5.static.flickr.com/4038/5145453784_861aaacc04_m.jpg

Friday, October 28, 2011

TRAC: A2.1: The right type of staff and skills

A2.1 Repository has identified and established the duties that it needs to perform and has appointed staff with adequate skills and experience to fulfill these duties.

The repository must identify the competencies and skill sets required to operate the repository over time and demonstrate that the staff and consultants have the range of requisite skills—e.g., archival training, technical skills, and legal expertise.

Evidence: A staffing plan; competency definitions; job description; development plans; plus evidence that the repository review and maintains these documents as requirements evolve. 



I see three main areas of evidence to support this requirement.

One is that ICPSR has been in operation for fifty years, and it continues to win contracts and grants to preserve and disseminate social science research data and documentation.  No organization can operate for fifty years if it does not have a team capable of delivering success.

Another bit of evidence appears on ICPSR's organization chart.  I don't believe we publish it for the world to see, but we do maintain a copy on our intranet site.  The org chart shows the areas, teams, and people one needs to curate content successfully.  Data managers?  Check.  Metadata specialists?  Check.  Technology?  Check.  Administrative functions?  Check.  And, of course, specialists in digital preservation policy and standards.

Finally, there is also evidence in the body of job descriptions one would find on the University of Michigan jobs site (if its content was preserved!).  One can see how job titles, job functions, and skillsets have evolved over the years as technology, best practices, and types of content have also evolved.

Wednesday, October 26, 2011

Using DuraCloud for Archiving and Preservation

I'll be joining Michele Kimpton, CEO of DuraSpace, on a webinar next Wednesday (November 2, 2011).  Our topic is DuraCloud, and how one can use this cloud-based service as part of one's digital preservation strategy.

I think sometimes people will view the cloud as an alternative to keeping and maintaining local copies, but at ICPSR we're using the cloud as an easy-to-manage storage location to supplement more conventional locations, such as local NAS storage and the University of Michigan's "Value Storage" service.

Here is a copy of the invite that went out via email:


DuraSpace

You are invited to attend the following event:

Using DuraCloud for Archiving & Preservation
Wednesday, November 2, 2011
1:00p.m. - 2:00p.m. Eastern Standard Time

Presented By:
Michele Kimpton, DuraSpace Chief Executive Officer & DuraCloud Project Director &
Bryan Beecher, Director of Computing & Network Services,Interuniversity Consortium for Political and Social Research (ICPSR)


Having a hard time keeping up with current preservation and archiving practices?
Are you finding the task of archiving your content complicated, costly and confusing?
Then you need to join us for a free webinar that details how DuraCloud can be part of your preservation and archiving solution.

This webinar will discuss how to use DuraCloud as a component of your archiving and preservation strategy. An overview of the service will include what it is, how it works, and the benefits it has to offer. Additionally, Bryan Beecher, Director of Computing & Network Services at ICPSR, will present ICPSR's preservation and archivingstrategy. Bryan will share how DuraCloud and other methods have been implemented to meet ICPSR's preservation and archiving goals.

If you are interested in attending please thoroughly complete the registration process (below) to receive your unique login url. Be sure to SAVE the return email you receive from Infinite Conferencing as it will include your unique login information.
 

Below is the call-in information for the event.
Via Skype (Free, World): Dial +9900827047086940 
Via Phone (Toll, US): Dial +1(201)793-9022 Enter Room Number: 7086940


The maximum capacity for this web seminar is 99 participants. The event will be recorded and slides will be available for viewing after the event at http://duraspace.org/web_seminars
.

Please contact Kristi Searle at ksearle@duraspace.org
 with any questions.

**Please be aware of Infinite Conferencing System Requirements:
-          Internet connection speed of 128 kbps or higher is recommended
-          Microsoft Windows XP, Vista, Windows 7, or Server 2003
-          Internet Explorer 6.0 SP2, 7.0, 8.0 & 9.0, Firefox 3.0x/3.5, 4, 5 and Chrome 12 browsers
-          Apple Mac with Intel CPU, Mac OS X 10.5/10.6, Safari 4.x, 5.x or Firefox 3.x, Java 1.5+
-          Linux, Unix, or Solaris with Mozilla 1.0+
-          Cookies and Scripting enabled in browser

Friday, October 21, 2011

TRAC: A1.2: Succession planning

A1.2 Repository has an appropriate, formal succession plan, contingency plans, and/or escrow arrangements in place in case the repository ceases to operate or the governing or funding institution substantially changes its scope. 

Part of the repository’s perpetual-care promise is a commitment to identify appropriate successors or arrangements should the need arise. Consideration needs to be given to this responsibility while the repository or data is viable—not when a crisis occurs—to avoid irreparable loss. Organizationally, the data in a repository can be at risk regardless of whether the repository is run by a commercial organization or a government entity (national library or archives). At government-managed repositories and archives, a change in government that significantly alters the funding, mission, collecting scope, or staffing of the institution may put the data at risk. These risks are similar to those faced by commercial and researchbased repositories and should minimally be addressed by succession plans for significant collections within the greater repository.

A formal succession plan should include the identification of trusted inheritors, if applicable, and the return of digital objects to depositors with adequate prior notification, etc. If a formal succession plan is not in place, the repository should be able to point to indicators that would form the basis of a plan, e.g., partners, commitment statements, likely heirs. Succession plans need not specify handoff of entire repository to a single organization if this is not feasible. Multiple inheritors are possible so long as the data remains accessible.

Evidence: Succession plan(s); escrow plan(s); explicit and specific statement documenting the intent to ensure continuity of the repository, and the steps taken and to be taken to ensure continuity; formal documents describing exit strategies and contingency plans; depositor agreements.
 



ICPSR has formalized succession planning and contingency planning via its commitment to Data-PASS, the Data Preservation Alliance for the Social Sciences.  As noted on the Data-PASS web portal:

Organizations join the Data-PASS partnership for several reasons. Membership in Data-PASS helps insure against preservation loss. Data-PASS safeguards the collections of its members through transfer protocols, succession planning, and live replication of collections. If a member organization requires off-site replication of its collections, the partnership will provide it. And if a member organization is no longer institutionally capable of preserving and disseminating a collection, the collection can be preserved and disseminated through the partnership.

This commitment helps ensure that content currently  held and managed by ICPSR will continue to be available even if ICPSR ceases to exist.

Wednesday, October 19, 2011

The RCS becomes the DARS

ICPSR first launched its Restricted Contract System (RCS) more than two years ago.  Since that initial launch we've learned a lot:  who actually uses the system to apply for access to data; how they experience the system; how ICPSR contract administrators use the system; and, how to build in workflow to make it a smoother experience for all parties.

We relaunched the RCS last week, but with a new name:  the ICPSR Data Access Request System (DARS).  I suspect a lot of us will continue to call it the RCS, but it is the same system, but with a very different look and flow.

The DARS home page is the same as the old RCS system, and the most typical access method for initial use is from the home page of a study.  If a version of a study is available through a data-use agreement, then a link appears on its home page, and clicking that link navigates the visitor to the DARS.

Once there the visitor can initiate the data-use agreement process, going through the same general steps as before.  However, it is now much clearer when the agreement process has been completed, and the ball is now in ICPSR's court for review.  We've also worked hard to distinguish between essential elements of the agreement (e.g., if it changes, then the agreement must be reviewed and signed again), and which are more tangential (e.g., if it changes, ICPSR will be notified, but the agreement need not be signed off on again).

One element of the redesign is an explicit acknowledgement that this system may be used for any data-use request, and is not limited to only restricted-use requests.  We based this change on feedback from an internal team of reviewers who thought that the system should be able to work for any type of content that requires an agreement, even if it isn't particularly sensitive or confidential.

This design also recognizes that the applicant using the system may not necessarily be the PI who is requesting access to the data.  (In fact, we suspect that most applicants are not the PI.)  We therefore built views and rules to make it easier for, say, a population center data coordinator who may be working on several request for several PIs to get a better view of status across all requests.

Monday, October 17, 2011

Tech@ICPSR heads to Gartner

Tech@ICPSR is attending the Gartner Symposium/ITxpo this week in sunny Orlando.  Rather than writing a single blog post summarizing the event, I'll try to post a series of very short posts throughout the week as I attend different sessions.  (Something bigger than a tweet, but not as long as a typical blog post here.)

Friday, October 14, 2011

TRAC: A1.1: Mission statement

A1.1 Repository has a mission statement that reflects a commitment to the long-term retention of, management of, and access to digital information.

The mission statement of the repository must be clearly identified and accessible to depositors and other stakeholders and contain an explicit long-term commitment.

Evidence: Mission statement for the repository; mission statement for the organizational context in which the repository sits; legal or legislative mandate; regulatory requirements.



Here is our mission statement: 

ICPSR provides leadership and training in data access, curation, and methods of analysis for a diverse and expanding social science research community.

The important word in the context of TRAC requirement A1.1 is curation.  My recollection is that as we were crafting the mission statement, it originally contained many, many more words, and one of the most time-consuming tasks was for the group to pare it down.  I think the term curation was meant to imply not passive storage of content, but active management of our content, beginning with its receipt in the deposit system, and then onward.

Along with a mission statement we also have a strategic plan (which is getting a little long in the tooth now that it is 2011), and the commitment to long-term retention and management of digital information also appears in several places therein.

Wednesday, October 12, 2011

A very brief introduction to FLAME - ICPSR's File-Level Archival Management Engine

We've closed the books on 2011 Q3 and have moved on to the list of priorities for Q4.  One of the top priorities is a new project called FLAME (File-Level Archival Management Engine).

The goal of the project is to re-tool ICPSR's primary technology infrastructure so that is file-oriented rather than "study"-oriented.  This is essential to ICPSR's future for two main reasons.

One, more and more of our content doesn't fit nicely into ICPSR's classic "survey data and codebook" model.  We're starting to handle content like classroom observation sessions (video) and open-ended textual content (qualitative data), and even some of our existing content (TIGER/Line files, Census 2000 summary files, CCEERC reports) does not fit into the current object model without much contortion.

Two, the file-level is a much better fit for mapping business functions to the Open Archival Information System (OAIS) reference model (pink book), and for conforming to best practices, such as the Trustworthy Repositories Audit and Certification checklist.  For example, if we want to be able to demonstrate trustworthiness when it comes to the mapping from a file we deliver on the web site to a file we have in archival storage to a file that was deposited, we need to collect information and manage content at the file level.

I've been looking at the wiki for Archivematica, a site that I learned about from Nancy McGovern.  They've created a use-case and one or more related microservices for many of the boxes and connectors in the OAIS reference model.  I like the idea of linking the software directly to the OAIS reference model like this, and I'm intending to make great use of the Archivematica work to help us here.


Clip art credit: http://www.flickr.com/photos/bycp/5690269952/sizes/s/in/photostream/

Monday, October 10, 2011

September 2011 deposits

The report for September:

# of files# of depositsFile format
11application/msaccess
3714application/msword
132application/octet-stream
12528application/pdf
3715application/vnd.ms-excel
74application/x-sas
7727application/x-spss
144application/x-stata
11application/x-zip
41message/rfc8220117bit
44text/html
22text/plain; charset=iso-8859-1
44text/plain; charset=unknown
19630text/plain; charset=us-ascii
43text/rtf
11text/x-c; charset=unknown
54text/x-c; charset=us-ascii

Pretty typical formats, and pretty normal volumes.  A few deserve investigation (octet-stream), and a few plain text files have been tagged as C source code (as usual).

Friday, October 7, 2011

ICPSR wins grant from the Bill and Melinda Gates Foundation

ICPSR and partners at the University of Michigan received a grant from the Bill and Melinda Gates Foundation recently.  You can find the official link here.

 The link says that the grant is to house and make available to qualified researchers the data collected by the Measures of Effective Teaching project, and than means that you'll soon see yet another ICPSR operated web portal.

In addition to building and operating the portal, the project also requires us to process, preserve, and deliver quantitative data related to the Measures of Effective Teaching (MET) project, which is right in ICPSR's wheelhouse, of course.  The really new element for us, though, is the collection of video and "artifacts" related to the video.

ICPSR will use its existing Restricted Contract System (RCS) to screen applicants who want to access the video collection.  If approved the applicant will be able to access a video streaming server to view the videos in the collection.  The applicant will also be able to access the quantitative data in our Virtual Data Enclave (VDE).

The video collection is large compared to our current holdings of survey and government data.  My sense is that our collection will pretty much double in size, approaching 20TB total.  That's very big for us, but not nearly as big as some collections of video.


Wednesday, October 5, 2011

Web availability through September 2011

ICPSR web availability through 9/2011
Web site availability was good, but not great in September.  We've found that our Solr search query process is the most fragile piece of the infrastructure, and it got "stuck" on Sunday evening, 9/25.  Usually these are easy-to-correct faults; we just restart the tomcat instance hosting the Solr search query service.  But on this particular night the on-call missed the page, and the U-M Network Operations Center (NOC) did not open a ticket and phone the on-call, and so it lasted closer to 90 minutes.

During that time the web site was still usable, of course, and lots of functions would have worked normally (viewing pages, download studies, using SDA, etc).  But we start our "unavailability counter" whenever any part of the infrastructure is unavailable.  But my apologies if you were trying to search our catalog at that time.

Our analysis is that the virtual machine is running out of memory on our current (but old) web server.  We have a new 64-bit machine with significantly more memory available, and we'll been prepping it to take over for the old machine.  In the process of building the new machine we've been upgrading versions of Red Hat, tomcat, java, and many other key elements, and this has made the going a bit more slow than usual, but should give us a machine with better software.  And software that doesn't need to be upgraded right away (I hope!).

Monday, October 3, 2011

All Things Confidential

Tech@ICPSR will be at the University of Michigan's Michigan Union this Thursday to participate in the 2011 biennial meeting of our Organization Representatives from across the world of higher-education.  We'll be batting lead-off for the All Things Confidential session at 9am EDT.

If you cannot attend the meeting in person, you can still attend virtually.