Friday, December 30, 2011

TRAC: A3.7: Transparency

A3.7 Repository commits to transparency and accountability in all actions supporting the operation and management of the repository, especially those that affect the preservation of digital content over time.

Transparency is the best assurance that the repository operates in accordance with accepted standards and practice. Transparency is essential to accountability, and both are achieved through active, ongoing documentation. The repository should be able to document its efforts to make information about its development, implementation, evolution, and performance available and accessible to relevant stakeholders. The usual means of communication an organization uses to provide significant news and updates to stakeholders should suffice for meeting this requirement.

Evidence: Comprehensive documentation that is readily accessible to stakeholders; unhindered access to content and associated information within repository.

This might be an easy requirement for ICPSR to meet. Since we're an "active archive" rather than a "dark archive" (as some would say), we have regular interaction with our designate community. This takes place in the form of downloads from our web site, mailings and blog posts about ICPSR news and events, regular meetings with our Council and our Organizational Representatives, and many other channels.

ICPSR's internal workings are documented in great detail in both paper form and on our Intranet. These documents serve to train new employees, students, and interns during the summer. And since there is always room for improvement, we are working to improve transparency as we develop FLAME, our next-generation, file-level archive management system.

Wednesday, December 28, 2011

Starting the FLAME

In an earlier post I described a major new project at ICPSR called FLAME. FLAME is the File-Level Archival Management Engine, and will become the new repository technology platform ICPSR uses to curate and preserve content. As the name implies, the main molecule of information upon which FLAME will operate is a "file" which is different than the main molecules used at ICPSR today, the "deposit" and the "study." In the big picture the activities at ICPSR will not change much: we will still collect social science research data, curate them, preserve them, and make them available in a wide variety of formats and modes. But when one looks at the details, an awful lot will change.

So when one is going to change everything, where does one start?

Fortunately we have a ready-made starting point with the Open Archival Information System (OAIS) reference model. While this does not give us a blue print of what to build, it does give us a model to use as we construct our blueprints. I believe this is very much what the folks at Archivematica have done.

So the question becomes: How do we translate a high-level reference model that contains functions such as Receive Submission to the low-level blue prints one needs to reconfigure process and build software? What kind of web applications do I need for Receive Submission? What should they do? Should that box that contains the submitter's identity be an email address? A text string? An ORCID?

So how to start?

One of my colleagues, Nancy McGovern, suggested we brainstorm 6-12 medium-level statements for each of the functions in the OAIS reference model. We started with Receive Submission, and indeed generated 12 statements. (The analogue at Archivematica is Receipt of SIP.) One example is:

The producer provided basic provenance information at deposit

If the metaphor for building FLAME is building a house, then OAIS plays the role of high-level best practices. The statements (like above) play the role of floor plans and elevations; those things to which most people can relate and make decisions. So this is moving in the right direction, but we're still lacking the blueprints.

The next step is to take a statement like the one above and turn it into requirements for software (and for process). One example requirement that flows from the statement above is:

FLAME should capture the following provenance information from the files after each content transfer:

i. Date and time at which each file is received

ii. Checksum of each file

iii. MIME type of each file

iv. Original name of each file

v. Packaging information (e.g., file was part of a Zip archive)

We can then discuss these low-level requirements with stakeholders, such as the acquisitions team, and with the technology team, such as a software developer who may have additional questions (e.g., "Well, what sort of checksum do you want - MD5, SHA-1, or something else?").

Right now we are working through the details of Receive Submission, and the next few stops on the roadmap will likely be in Ingest as well. We're documenting both the high-level statements and the low-level requirements in a Drupal CMS that we use as our Intranet.

Monday, December 26, 2011

Tech@ICPSR takes a holiday

As far as the University of Michigan is concerned, today is Christmas Day. (I know this. It says so on my timesheet.) Tech@ICPSR had too much egg nog and is taking the day off.

Friday, December 23, 2011

TRAC: A3.6: Change logs

A3.6 Repository has a documented history of the changes to its operations, procedures, software, and hardware that, where appropriate, is linked to relevant preservation strategies and describes potential effects on preserving digital content.

The repository must document the full range of its activities and developments over time, including decisions about the organizational and technological infrastructure. If the repository uses software to document this history, it should be able to demonstrate this tracking.

Evidence: Policies, procedures, and results of changes that affect all levels of the repository: objects, aggregations of objects; object-level preservation metadata; repository’s records retention strategy document.

I'll focus on the technology pieces of this story.

The technology team maintains a change log of major (and not-so-major) system changes. Moving a chunk of storage from Network Attached Storage appliance A to B? It's in the log. Upgrading the hardware of the web server we used to stage new content and software? It's in the log. Updating business productivity software to enforce newly declared business rules that effect how it should work? It's in the log.

We do a pretty good job overall recording technology changes in a well-known, recorded space. (Our Intranet is hosted in the Drupal CMS by the U-M central IT organization.) Of course, there is always room for improvement, but the big stuff gets documented. For instance, I know that we don't always record changes to desktop workstations (e.g., Windows patches) in the change log, even though we do generate an announcement via email.

Wednesday, December 21, 2011

Amazon's new AWS icons

Below is a completely unreadable schematic of ICPSR's replica of its web infrastructure in Amazon's cloud. This is just one way that we're using the cloud, and Amazon in particular.

Click the picture to enlarge.

Two weeks ago, Amazon released a nice set of icons for use in common drawing and presentation software. The set contains an icon for all of the Amazon Web Services (AWS) services and types of infrastructure, and it also contains generic, gray icons for non-AWS elements. I used the icons to create the nice schematic above.

The diagram is based on one of the examples Amazon includes in the PPTX-format set of icons. I needed to delete a few services and servers that we don't use (e.g., Route 53 for DNS). The diagram shows the ICPSR machine room on the left, and the three main systems that deliver our production web service: a big web server, an even bigger database server, and an even bigger still EMC storage appliance. We synchronize the content from these systems into corresponding systems in the AWS cloud.

We use EC2 instances in the US-East region to host our replica. Unlike physical hardware where we sometimes host multiple IP addresses on a single machine, we maintain a one-to-one mapping between virtual machines and IP addresses in EC2. And so one physical web server in ICPSR's machine room ends up as a pair of virtual servers in Amazon's cloud.

We initiate a failover by changing the DNS A (address) record for www.icpsr.umich.edu and www.cceerc.org. This change can take place on either a physical DNS server located at ICPSR or a virtual DNS server located in AWS. The time-to-live (TTL) is very low, only 300 seconds, and so once we initiate the failover procedure, web browsers will start using the replica very soon. (However, we have noticed that long-lived processes which do not regularly refresh name-to-address resolution for URLs, like crawlers, take much longer to failover.)

The replica supports most of the common services on the production ICPSR web site, such as search, download, analyze online, etc, but it does not support services where someone submits content to us, such as the Deposit System.

It is important to note that our replica is intended as a disaster recovery (DR) solution, not a high availability solution. That is, the purpose of the replica is to allow ICPSR to recover quickly from failure, and to avoid a long (e.g., multi-day) period of unavailability. The replica design is not at all a solution for a high-availability web site, one that would never be down even for a second. It would take a significant investment to change the architecture of ICPSR's delivery platform to meet such a requirement.

Monday, December 19, 2011

ICPSR is now hiring!

ICPSR has posted a job description for a software developer to work on a new grant we've received from the Bill and Melinda Gates Foundation. We call it the "MET Extension" project, and the grant was just awarded in November. It's a two-year grant.

The main deliverable of the grant is to extend the technology platform that we're building as part of another BMGF grant - the "MET" project (which is also a two-year grant and was awarded in August 2011). Both projects are largely video-oriented, building systems to stream video content safely and securely to approved researchers.

The lead developer on the original MET project is a gent named Cole Whiteman. Cole will be familiar to many in the local Ann Arbor tech community, and has also given presentations about ICPSR to groups in venues well beyond the borders of Washtenaw County. The person in this position will work closely with Cole.

Careful readers will note that the position is "term limited" which means that we've made it very explicit that the funding for this position stops at the end of 2013. That said, we've been hiring developers steadily over the past seven years as we've added grants to our portfolio, and we haven't had to eliminate any of those positions yet. Beyond all expectations due to the economy in Michigan, business is still booming at ICPSR.

Here is a link to the position on the U-M jobs site: http://umjobs.org/job_detail/64682/software_developer_senior

And because that link will break once the job posting goes inactive, here is my "permalink" to the position:

Software Developer Senior

Job Summary

A unit of the Institute for Social Research (ISR) at the University of Michigan, the Inter-university Consortium for Political and Social Research (ICPSR) is an international membership organization, with over 500 institutions from around the world, providing access to empirical social science data for research and teaching. The atmosphere is relaxed and collegial with a work environment that fosters communication and networking, incorporates a diverse staff with varying skills and experiences and offers family-friendly flexibility with work hours.

The University of Michigan (U-M) is currently under contract to archive and disseminate (video and quantitative) data from the first phase of the Measuring Effective Teaching (MET) project, sponsored by the Bill and Melinda Gates Foundation. That project collected quantitative data and classroom video from over 3,000 teacher volunteers during the 2009-10 and 2010-11 school years. Data from that project are being archived and distributed to the social and educational research community through the hosting of a MET Longitudinal Database at the University's Inter-university Consortium for Political and Social Research, the world's largest social science data archive. ICPSR seeks a Software Developer Senior to assist with this project.

**Please note this is a terminal appointment, with an anticipated end date of 12/01/2013.**

Essential responsibilities of this position include: coordination of software development activities with other ICPSR development projects; estimation of task level details and associated delivery timeframes; source code control and version management; release management and coordination with ICPSR staff; documentation production and management; training materials production and management; and, software support and trouble-shooting. Finally, the person in this position will be expected to freshen, broaden, and deepen their professional and technical skills via regular participation in professional development activities such as training, seminars, and tutorials.

NOTE: Part of this job may require some work outside normal working hours to analyze and correct critical problems that arise in ICPSR's 24 hours per day operational environment.

Desired Qualifications*

--Bachelor Degree in Computer Science or Computer Engineering, or the equivalent education and experience is required
--5 or more years of professional software development experience using Java / J2EE
--RDBMS vendor (Oracle, Microsoft, or MySQL) certification preferable
--Sun Java Developer certification preferable
--Extensive knowledge of XML, XSLT, JSON, REST, and SOAP is required
--5 or more years of professional business analyst experience requirement
--Linux systems usage; Windows XP or Vista usage, including common applications such as Word, Excel and Outlook

U-M EEO/AA Statement

The University of Michigan is an equal opportunity/affirmative action employer.

Friday, December 16, 2011

TRAC: A3.5: Seeking and acting on feedback

A3.5 Repository has policies and procedures to ensure that feedback from producers and users is sought and addressed over time.

The repository should be able to demonstrate that it is meeting explicit requirements, that it systematically and routinely seeks feedback from stakeholders to monitor expectations and results, and that it is responsive to the evolution of requirements.

Evidence: A policy that requires a feedback mechanism; a procedure that addresses how the repository seeks, captures, and documents responses to feedback; documentation of workflow for feedback (i.e., how feedback is used and managed); quality assurance records.

I think the market economy that keeps ICPSR in business is the very best evidence that the organization seeks input from its community, and applies that feedback to its operations, content selection, preservation strategies, and nearly every element of its business.

In practice we can see many different types of feedback mechanisms: contract renewals; annual membership renewals; biennial Organizational Representative meetings; regular ICPSR Council meetings; and, regular participation at all sorts of public forms about social science research data, digital preservation, technology, etc. It also happens electronically via social media, a helpdesk where a real person answers the phone and emails, and feedback pages on the main web portal.

In some ways it feels as if this TRAC requirement is aimed at organizations that might be funded by one community, like a national government, but used by a very different community, such as research. In a scenario where the consumers and the payers are different, it is indeed critical that there be some mechanism to collect input, or the repository could enter a kind of "zombie" state where it ceases to serve its community effectively, but the funding organization continues to fund the repository nonetheless.

That said, I do think there is room for improvement in this area for ICPSR. In particular, I think there is a great opportunity to work more closely with the individual data producers, engaging them in the curation process, and making the workflow - from deposit through eventual release - more transparent.

Wednesday, December 14, 2011

Google Music keeps the tunes playing

I started using the new Google Music production service. I hadn't explored Google's previous offering, the Music Beta, all that much, but decided the time was right to dip a toe into the water.

The service has a lot of similarities to iTunes, of course, except one's library is in the cloud rather than on a PC (assuming one isn't using Apple's iCloud). Google gives one free space to store 20k songs. I'm using about 1% of that quota so far.

I like the idea of having a copy of our music in the cloud as an additional backup (or preservation copy), and it is also nice being able to use a standard browser window to manage and play the music. One complaint I have about iTunes is that because it is conventional desktop software, one has to update it from time to time. And this is somewhat more burdensome if one has to switch from a "standard" type of login on Windows to one with administrative rights, and then switch back again.

Google provides a tool which will copy music from one's existing storehouse (mine was an iTunes library). The tool worked well for this purpose, and it did NOT require any administrative rights on my home WinXP (I know, I know) to download, install, and execute. I started the copy one evening, and some 400 songs had been copied into Google Music by the morning. One feature request: It would be fabulous if the Music Manager tool would pull songs directly from a CD.

On the back-end I wonder if Google is using some form of de-duplication to minimize the amount of storage it needs to provision for this service? It must be the case that there would be great overlap between music collections, particularly with the most popular songs, artists, albums, etc. Google does such a good job of squeezing storage efficiency out of GMail; would expect them to do the same for their music service.

Monday, December 12, 2011

ICPSR web availability through November 2011

Now that's what I'm talking about! (Click the image to see a more readable chart.)

ICPSR FY 2012 web availability through November 2011

After some shamefully low availability numbers in September and October, we've rebounded nicely in November (over 99.9%).

We saw two main problems in the month.

One was a short outage where our search engine (Solr) faulted and required a restart. We think this is due to a memory leak in Solr, and we are hoping that we can avoid the problem more completely once we move from our older 32-bit web hardware to our new 64-bit machine. I'm hoping this happens before the end of the calendar year.

The other outage was due (we think) to a campus power blip that seemed to cause a fault with our EMC storage appliance. While ICPSR never lost power, and while the machine room has a large, new UPS system, we speculate that the EMC got confused when it lost contact with UMROOT Windows Domain Controllers across campus due to the network path fluctuating. The problem solved itself after 15 minutes, and it was the only anomaly that was coincident with the EMC hanging.

Friday, December 9, 2011

TRAC: A3.4: Formal, periodic review

A3.4 Repository is committed to formal, periodic review and assessment to ensure responsiveness to technological developments and evolving requirements.

Long-term preservation is a shared and complex responsibility. A trusted digital repository contributes to and benefits from the breadth and depth of community-based standards and practice. Regular review is a requisite for ongoing and healthy development of the repository. The organizational context of the repository should determine the frequency of, extent of, and process for self-assessment. The repository must also be able to provide a specific set of requirements it has defined, is maintaining, and is striving to meet. (See also A3.9.)

Evidence: A self-assessment schedule, timetables for review and certification; results of self-assessment; evidence of implementation of review outcomes.

Steve Abrams from the California Digital Library gave an interesting talk earlier this year about the notion of applying a Neighborhood Watch metaphor to digital archives. You can find a PDF of the slideshow here.

This is a nice paradigm, and it fits well with some of the work ICPSR is doing with its Data-PASS partners. We're using the Stanford Lots of Copies Keep Stuff Safe (LOCKSS) software in a Private LOCKSS Network (PLN) to build a distributed archival storage network. And in addition to the PLN, we have also built tools to verify the integrity of the PLN and its content. We call this additional layer the SAFE-Archive, and the development has been led by the Odum Institute at the University of North Carolina.

I also see ICPSR periodically assess itself on a regular basis in response to opportunities to expand its reach thematically or technologically. For example, as ICPSR enters into the world of digital preservation for video as part of two recent grants from the Bill and Melinda Gates Foundation, this drives ICPSR to re-evaluate how it manages content.

I'm not sure that these types of activities are as formal as the TRAC requirement might like, and so the action item might look more like a documentation project rather than adding a new activity into ICPSR's standard operating procedures.

Wednesday, December 7, 2011

November 2011 deposits at ICPSR

Chart? Chart.[1]

# of files	# of deposits	File format
1	1	application/msaccess
37	14	application/msword
14	3	application/octet-stream
161	24	application/pdf
16	1	application/postscript
440	10	application/vnd.ms-excel
1	1	application/vnd.ms-powerpoint
106	1	application/x-arcview
53	1	application/x-dbase
10	2	application/x-dosexec
1	1	application/x-executable, dynamically linked (uses shared libs), not stripped
1	1	application/x-rar
20	9	application/x-sas
8	1	application/x-sharedlib, not stripped
4	1	application/x-shellscript
9667	33	application/x-spss
23	4	application/x-stata
5	3	application/x-zip
12	1	image/gif
1	1	image/jpeg
2	1	image/x-xpm0117bit
2	2	message/rfc8220117bit
397	12	text/html
11	8	text/plain; charset=iso-8859-1
13	8	text/plain; charset=unknown
1482	42	text/plain; charset=us-ascii
16	3	text/rtf
4	1	text/x-c++; charset=us-ascii
8	2	text/x-c; charset=us-ascii
1	1	text/xml

A relatively heavy month for deposits, this November 2011. The number of SPSS files deposited is really impressive, and look to be the result of a small number of deposits, but with many data files. Quite a bit of Excel too; more than SAS and Stata combined.

We also have the usually fishy looking items that have been auto-detected as C or C++ source code, and which are actually text/plain (I suspect). If the setups for a stat package contain the right types of comments in the right places, file is easy to fool.

The dBase and ArcView files are an interesting add to this month's listing. We don't see too many of those.

[1] This is a very small homage to mgoblog.

Monday, December 5, 2011

Collaborators, not depositors

ICPSR should stop accepting deposits.

Instead ICPSR should be recruiting collaborators.

To be sure ICPSR receives a great deal of its content via US Government agencies who have decided to outsource the digital preservation of their content to a trustworthy repository like ICPSR. In this case the relevant contract, grant, or inter-agency agreement makes it clear what content will be coming to ICPSR to be curated and preserved. In some cases the agency has little interest in depositing content ("Isn't that what we pay you for?"), and so the formal act of depositing content falls to the ICPSR staff anyway.

However, we also receive a considerable volume of content through our web portal where the depositor is external. Sometimes we have worked hard to acquire the content, and the deposit is one milestone on a very long road, but other times the content comes to us unsolicited. (I like to call these "drive-by deposits.")

In some cases the depositor is quite eager and able to help ICPSR with much of the curation work: drafting rich descriptive metadata; organizing survey data and documentation into coherent groups; packaging other types of content into logical bundles (such as with our Publication-Related Archive); and, reviewing the data for possible disclosure risks. Depositors may have access to resources like graduate students who can help with these tasks, and if the depositor is also the data producer, then s/he has valuable, unique insight into the data and documentation. Unfortunately ICPSR is not well poised to tap into that expertise and those resources.

What would it take to get there?

ICPSR could separate the transactional step of submitting content (i.e., file upload concurrent with signature) from the iterative step of preparing metadata applicable to the submitted content. In fact, one could even prepare metadata well before the submission transaction if the data producer had the interest and resources to prepare that information, but was not quite ready to share the data yet. And, it would be equally permissible to submit the data for preservation and sharing, and then build the metadata slowly during the weeks and months following the upload.

If the data producer could also export the metadata in machine actionable formats, say, DDI XML for content which maps well to the classic "study" object that ICPSR has curated and preserved for decades, then there may be additional value to the producer. And introducing the structure that comes along with an XML schema like DDI might also be valuable to the producer in terms of thinking about and organizing the documentation, even for his/her own use.

In this world the ICPSR deposit system becomes a much shorter, much simpler web application. And the ICPSR data management infrastructure would need to be opened up -- but with serious access controls -- so that content providers could access, create, and revise their documentation and metadata. But the best thing about this world is that ICPSR gains a lot of collaborators, some who would be quite eager to work with us, I think.

Friday, December 2, 2011

TRAC: A3.3: Permission to preserve

A3.3 Repository maintains written policies that specify the nature of any legal permissions required to preserve digital content over time, and repository can demonstrate that these permissions have been acquired when needed.

Because the right to change or alter digital information is often restricted by law to the creator, it is important that digital repositories address the need to be able to work with and potentially modify digital objects to keep them accessible over time. Repositories should have written policies and agreements with depositors that specify and/or transfer certain rights to the repository enabling appropriate and necessary preservation actions to take place on the digital objects within the repository.

Because legal negotiations can take time, potentially slowing or preventing the ingest of digital objects at risk, a digital repository may take in or accept digital objects even with only minimal preservation rights using an open-ended agreement and address more detailed rights later. A repository’s rights must at least limit the repository’s liability or legal exposure that threatens the repository itself. A repository does not have sufficient control of the information if the repository itself is legally at risk.

Evidence: Deposit agreements; records schedule; digital preservation policies; records legislation and policies; service agreements.

ICPSR has a standard agreement that is uses for all deposits. This agreement grants ICPSR the non-exclusive right to replicate the content for preservation purposes and to deliver the content on our web site. This language resides inside of our Deposit Form web application.

This works very well for deposits that come from a known source, such as a government agency with whom we have an agreement to preserve and deliver content, or an individual researcher with whom we have been corresponding. In this case we have a good sense for who the depositor is, the role they play with regard to the data, and the mechanisms by which we can contact him/her.

Things become a bit messier what I will call a "drive-by deposit." This is an unsolicited, unexpected deposit, and in this case the depositor agrees to give us permission to make copies of the content for digital preservation purposes and to deliver the content via our web portal. That said, ICPSR does not require strong identities to execute a deposit, and so one could ask the question: How does ICPSR know that the depositor himself/herself has the authority to grant us rights to preserve and redistribute the content?