A3.7 Repository commits to transparency and accountability in all actions supporting the
operation and management of the repository, especially those that affect the preservation
of digital content over time.
Transparency is the best assurance that the repository operates in accordance with accepted standards and
practice. Transparency is essential to accountability, and both are achieved through active, ongoing
documentation. The repository should be able to document its efforts to make information about its
development, implementation, evolution, and performance available and accessible to relevant
stakeholders. The usual means of communication an organization uses to provide significant news and
updates to stakeholders should suffice for meeting this requirement.
Evidence: Comprehensive documentation that is readily accessible to stakeholders; unhindered access to
content and associated information within repository.
This might be an easy requirement for ICPSR to meet. Since we're an "active archive" rather than a "dark archive" (as some would say), we have regular interaction with our designate community. This takes place in the form of downloads from our web site, mailings and blog posts about ICPSR news and events, regular meetings with our Council and our Organizational Representatives, and many other channels.
ICPSR's internal workings are documented in great detail in both paper form and on our Intranet. These documents serve to train new employees, students, and interns during the summer. And since there is always room for improvement, we are working to improve transparency as we develop FLAME, our next-generation, file-level archive management system.
News and commentary about new technology-related projects under development at ICPSR
Friday, December 30, 2011
Wednesday, December 28, 2011
Starting the FLAME
In an earlier post I described a major new project at ICPSR called FLAME. FLAME is the File-Level Archival Management Engine, and will become the new repository technology platform ICPSR uses to curate and preserve content. As the name implies, the main molecule of information upon which FLAME will operate is a "file" which is different than the main molecules used at ICPSR today, the "deposit" and the "study." In the big picture the activities at ICPSR will not change much: we will still collect social science research data, curate them, preserve them, and make them available in a wide variety of formats and modes. But when one looks at the details, an awful lot will change.
So when one is going to change everything, where does one start?
Fortunately we have a ready-made starting point with the Open Archival Information System (OAIS) reference model. While this does not give us a blue print of what to build, it does give us a model to use as we construct our blueprints. I believe this is very much what the folks at Archivematica have done.
So the question becomes: How do we translate a high-level reference model that contains functions such as Receive Submission to the low-level blue prints one needs to reconfigure process and build software? What kind of web applications do I need for Receive Submission? What should they do? Should that box that contains the submitter's identity be an email address? A text string? An ORCID?
So how to start?
One of my colleagues, Nancy McGovern, suggested we brainstorm 6-12 medium-level statements for each of the functions in the OAIS reference model. We started with Receive Submission, and indeed generated 12 statements. (The analogue at Archivematica is Receipt of SIP.) One example is:
If the metaphor for building FLAME is building a house, then OAIS plays the role of high-level best practices. The statements (like above) play the role of floor plans and elevations; those things to which most people can relate and make decisions. So this is moving in the right direction, but we're still lacking the blueprints.
The next step is to take a statement like the one above and turn it into requirements for software (and for process). One example requirement that flows from the statement above is:
We can then discuss these low-level requirements with stakeholders, such as the acquisitions team, and with the technology team, such as a software developer who may have additional questions (e.g., "Well, what sort of checksum do you want - MD5, SHA-1, or something else?").
Right now we are working through the details of Receive Submission, and the next few stops on the roadmap will likely be in Ingest as well. We're documenting both the high-level statements and the low-level requirements in a Drupal CMS that we use as our Intranet.
So when one is going to change everything, where does one start?
Fortunately we have a ready-made starting point with the Open Archival Information System (OAIS) reference model. While this does not give us a blue print of what to build, it does give us a model to use as we construct our blueprints. I believe this is very much what the folks at Archivematica have done.
So the question becomes: How do we translate a high-level reference model that contains functions such as Receive Submission to the low-level blue prints one needs to reconfigure process and build software? What kind of web applications do I need for Receive Submission? What should they do? Should that box that contains the submitter's identity be an email address? A text string? An ORCID?
So how to start?
One of my colleagues, Nancy McGovern, suggested we brainstorm 6-12 medium-level statements for each of the functions in the OAIS reference model. We started with Receive Submission, and indeed generated 12 statements. (The analogue at Archivematica is Receipt of SIP.) One example is:
The producer provided basic provenance information at deposit
If the metaphor for building FLAME is building a house, then OAIS plays the role of high-level best practices. The statements (like above) play the role of floor plans and elevations; those things to which most people can relate and make decisions. So this is moving in the right direction, but we're still lacking the blueprints.
The next step is to take a statement like the one above and turn it into requirements for software (and for process). One example requirement that flows from the statement above is:
FLAME should capture the following provenance information from the files after each content transfer:
i. Date and time at which each file is received
ii. Checksum of each file
iii. MIME type of each file
iv. Original name of each file
v. Packaging information (e.g., file was part of a Zip archive)
Right now we are working through the details of Receive Submission, and the next few stops on the roadmap will likely be in Ingest as well. We're documenting both the high-level statements and the low-level requirements in a Drupal CMS that we use as our Intranet.
Monday, December 26, 2011
Tech@ICPSR takes a holiday
As far as the University of Michigan is concerned, today is Christmas Day. (I know this. It says so on my timesheet.) Tech@ICPSR had too much egg nog and is taking the day off.
Friday, December 23, 2011
TRAC: A3.6: Change logs
A3.6 Repository has a documented history of the changes to its operations, procedures,
software, and hardware that, where appropriate, is linked to relevant preservation
strategies and describes potential effects on preserving digital content.
The repository must document the full range of its activities and developments over time, including decisions about the organizational and technological infrastructure. If the repository uses software to document this history, it should be able to demonstrate this tracking.
Evidence: Policies, procedures, and results of changes that affect all levels of the repository: objects, aggregations of objects; object-level preservation metadata; repository’s records retention strategy document.
I'll focus on the technology pieces of this story.
The technology team maintains a change log of major (and not-so-major) system changes. Moving a chunk of storage from Network Attached Storage appliance A to B? It's in the log. Upgrading the hardware of the web server we used to stage new content and software? It's in the log. Updating business productivity software to enforce newly declared business rules that effect how it should work? It's in the log.
We do a pretty good job overall recording technology changes in a well-known, recorded space. (Our Intranet is hosted in the Drupal CMS by the U-M central IT organization.) Of course, there is always room for improvement, but the big stuff gets documented. For instance, I know that we don't always record changes to desktop workstations (e.g., Windows patches) in the change log, even though we do generate an announcement via email.
The repository must document the full range of its activities and developments over time, including decisions about the organizational and technological infrastructure. If the repository uses software to document this history, it should be able to demonstrate this tracking.
Evidence: Policies, procedures, and results of changes that affect all levels of the repository: objects, aggregations of objects; object-level preservation metadata; repository’s records retention strategy document.
I'll focus on the technology pieces of this story.
The technology team maintains a change log of major (and not-so-major) system changes. Moving a chunk of storage from Network Attached Storage appliance A to B? It's in the log. Upgrading the hardware of the web server we used to stage new content and software? It's in the log. Updating business productivity software to enforce newly declared business rules that effect how it should work? It's in the log.
We do a pretty good job overall recording technology changes in a well-known, recorded space. (Our Intranet is hosted in the Drupal CMS by the U-M central IT organization.) Of course, there is always room for improvement, but the big stuff gets documented. For instance, I know that we don't always record changes to desktop workstations (e.g., Windows patches) in the change log, even though we do generate an announcement via email.
Labels:
archival storage,
digital preservation,
infrastructure,
trac
Wednesday, December 21, 2011
Amazon's new AWS icons
Below is a completely unreadable schematic of ICPSR's replica of its web infrastructure in Amazon's cloud. This is just one way that we're using the cloud, and Amazon in particular.
Two weeks ago, Amazon released a nice set of icons for use in common drawing and presentation software. The set contains an icon for all of the Amazon Web Services (AWS) services and types of infrastructure, and it also contains generic, gray icons for non-AWS elements. I used the icons to create the nice schematic above.
The diagram is based on one of the examples Amazon includes in the PPTX-format set of icons. I needed to delete a few services and servers that we don't use (e.g., Route 53 for DNS). The diagram shows the ICPSR machine room on the left, and the three main systems that deliver our production web service: a big web server, an even bigger database server, and an even bigger still EMC storage appliance. We synchronize the content from these systems into corresponding systems in the AWS cloud.
We use EC2 instances in the US-East region to host our replica. Unlike physical hardware where we sometimes host multiple IP addresses on a single machine, we maintain a one-to-one mapping between virtual machines and IP addresses in EC2. And so one physical web server in ICPSR's machine room ends up as a pair of virtual servers in Amazon's cloud.
We initiate a failover by changing the DNS A (address) record for www.icpsr.umich.edu and www.cceerc.org. This change can take place on either a physical DNS server located at ICPSR or a virtual DNS server located in AWS. The time-to-live (TTL) is very low, only 300 seconds, and so once we initiate the failover procedure, web browsers will start using the replica very soon. (However, we have noticed that long-lived processes which do not regularly refresh name-to-address resolution for URLs, like crawlers, take much longer to failover.)
The replica supports most of the common services on the production ICPSR web site, such as search, download, analyze online, etc, but it does not support services where someone submits content to us, such as the Deposit System.
It is important to note that our replica is intended as a disaster recovery (DR) solution, not a high availability solution. That is, the purpose of the replica is to allow ICPSR to recover quickly from failure, and to avoid a long (e.g., multi-day) period of unavailability. The replica design is not at all a solution for a high-availability web site, one that would never be down even for a second. It would take a significant investment to change the architecture of ICPSR's delivery platform to meet such a requirement.
Click the picture to enlarge. |
Two weeks ago, Amazon released a nice set of icons for use in common drawing and presentation software. The set contains an icon for all of the Amazon Web Services (AWS) services and types of infrastructure, and it also contains generic, gray icons for non-AWS elements. I used the icons to create the nice schematic above.
The diagram is based on one of the examples Amazon includes in the PPTX-format set of icons. I needed to delete a few services and servers that we don't use (e.g., Route 53 for DNS). The diagram shows the ICPSR machine room on the left, and the three main systems that deliver our production web service: a big web server, an even bigger database server, and an even bigger still EMC storage appliance. We synchronize the content from these systems into corresponding systems in the AWS cloud.
We use EC2 instances in the US-East region to host our replica. Unlike physical hardware where we sometimes host multiple IP addresses on a single machine, we maintain a one-to-one mapping between virtual machines and IP addresses in EC2. And so one physical web server in ICPSR's machine room ends up as a pair of virtual servers in Amazon's cloud.
We initiate a failover by changing the DNS A (address) record for www.icpsr.umich.edu and www.cceerc.org. This change can take place on either a physical DNS server located at ICPSR or a virtual DNS server located in AWS. The time-to-live (TTL) is very low, only 300 seconds, and so once we initiate the failover procedure, web browsers will start using the replica very soon. (However, we have noticed that long-lived processes which do not regularly refresh name-to-address resolution for URLs, like crawlers, take much longer to failover.)
The replica supports most of the common services on the production ICPSR web site, such as search, download, analyze online, etc, but it does not support services where someone submits content to us, such as the Deposit System.
It is important to note that our replica is intended as a disaster recovery (DR) solution, not a high availability solution. That is, the purpose of the replica is to allow ICPSR to recover quickly from failure, and to avoid a long (e.g., multi-day) period of unavailability. The replica design is not at all a solution for a high-availability web site, one that would never be down even for a second. It would take a significant investment to change the architecture of ICPSR's delivery platform to meet such a requirement.
Monday, December 19, 2011
ICPSR is now hiring!
ICPSR has posted a job description for a software developer to work on a new grant we've received from the Bill and Melinda Gates Foundation. We call it the "MET Extension" project, and the grant was just awarded in November. It's a two-year grant.
The main deliverable of the grant is to extend the technology platform that we're building as part of another BMGF grant - the "MET" project (which is also a two-year grant and was awarded in August 2011). Both projects are largely video-oriented, building systems to stream video content safely and securely to approved researchers.
The lead developer on the original MET project is a gent named Cole Whiteman. Cole will be familiar to many in the local Ann Arbor tech community, and has also given presentations about ICPSR to groups in venues well beyond the borders of Washtenaw County. The person in this position will work closely with Cole.
Careful readers will note that the position is "term limited" which means that we've made it very explicit that the funding for this position stops at the end of 2013. That said, we've been hiring developers steadily over the past seven years as we've added grants to our portfolio, and we haven't had to eliminate any of those positions yet. Beyond all expectations due to the economy in Michigan, business is still booming at ICPSR.
Here is a link to the position on the U-M jobs site: http://umjobs.org/job_detail/64682/software_developer_senior
And because that link will break once the job posting goes inactive, here is my "permalink" to the position:
The main deliverable of the grant is to extend the technology platform that we're building as part of another BMGF grant - the "MET" project (which is also a two-year grant and was awarded in August 2011). Both projects are largely video-oriented, building systems to stream video content safely and securely to approved researchers.
The lead developer on the original MET project is a gent named Cole Whiteman. Cole will be familiar to many in the local Ann Arbor tech community, and has also given presentations about ICPSR to groups in venues well beyond the borders of Washtenaw County. The person in this position will work closely with Cole.
Careful readers will note that the position is "term limited" which means that we've made it very explicit that the funding for this position stops at the end of 2013. That said, we've been hiring developers steadily over the past seven years as we've added grants to our portfolio, and we haven't had to eliminate any of those positions yet. Beyond all expectations due to the economy in Michigan, business is still booming at ICPSR.
Here is a link to the position on the U-M jobs site: http://umjobs.org/job_detail/64682/software_developer_senior
And because that link will break once the job posting goes inactive, here is my "permalink" to the position:
Software Developer Senior
Job Summary
A unit of the Institute for Social Research (ISR) at the University of Michigan, the Inter-university Consortium for Political and Social Research (ICPSR) is an international membership organization, with over 500 institutions from around the world, providing access to empirical social science data for research and teaching. The atmosphere is relaxed and collegial with a work environment that fosters communication and networking, incorporates a diverse staff with varying skills and experiences and offers family-friendly flexibility with work hours.
The University of Michigan (U-M) is currently under contract to archive and disseminate (video and quantitative) data from the first phase of the Measuring Effective Teaching (MET) project, sponsored by the Bill and Melinda Gates Foundation. That project collected quantitative data and classroom video from over 3,000 teacher volunteers during the 2009-10 and 2010-11 school years. Data from that project are being archived and distributed to the social and educational research community through the hosting of a MET Longitudinal Database at the University's Inter-university Consortium for Political and Social Research, the world's largest social science data archive. ICPSR seeks a Software Developer Senior to assist with this project.
**Please note this is a terminal appointment, with an anticipated end date of 12/01/2013.**
Essential responsibilities of this position include: coordination of software development activities with other ICPSR development projects; estimation of task level details and associated delivery timeframes; source code control and version management; release management and coordination with ICPSR staff; documentation production and management; training materials production and management; and, software support and trouble-shooting. Finally, the person in this position will be expected to freshen, broaden, and deepen their professional and technical skills via regular participation in professional development activities such as training, seminars, and tutorials.
NOTE: Part of this job may require some work outside normal working hours to analyze and correct critical problems that arise in ICPSR's 24 hours per day operational environment.
The University of Michigan (U-M) is currently under contract to archive and disseminate (video and quantitative) data from the first phase of the Measuring Effective Teaching (MET) project, sponsored by the Bill and Melinda Gates Foundation. That project collected quantitative data and classroom video from over 3,000 teacher volunteers during the 2009-10 and 2010-11 school years. Data from that project are being archived and distributed to the social and educational research community through the hosting of a MET Longitudinal Database at the University's Inter-university Consortium for Political and Social Research, the world's largest social science data archive. ICPSR seeks a Software Developer Senior to assist with this project.
**Please note this is a terminal appointment, with an anticipated end date of 12/01/2013.**
Essential responsibilities of this position include: coordination of software development activities with other ICPSR development projects; estimation of task level details and associated delivery timeframes; source code control and version management; release management and coordination with ICPSR staff; documentation production and management; training materials production and management; and, software support and trouble-shooting. Finally, the person in this position will be expected to freshen, broaden, and deepen their professional and technical skills via regular participation in professional development activities such as training, seminars, and tutorials.
NOTE: Part of this job may require some work outside normal working hours to analyze and correct critical problems that arise in ICPSR's 24 hours per day operational environment.
Desired Qualifications*
--Bachelor Degree in Computer Science or Computer Engineering, or the equivalent education and experience is required
--5 or more years of professional software development experience using Java / J2EE
--RDBMS vendor (Oracle, Microsoft, or MySQL) certification preferable
--Sun Java Developer certification preferable
--Extensive knowledge of XML, XSLT, JSON, REST, and SOAP is required
--5 or more years of professional business analyst experience requirement
--Linux systems usage; Windows XP or Vista usage, including common applications such as Word, Excel and Outlook
--5 or more years of professional software development experience using Java / J2EE
--RDBMS vendor (Oracle, Microsoft, or MySQL) certification preferable
--Sun Java Developer certification preferable
--Extensive knowledge of XML, XSLT, JSON, REST, and SOAP is required
--5 or more years of professional business analyst experience requirement
--Linux systems usage; Windows XP or Vista usage, including common applications such as Word, Excel and Outlook
U-M EEO/AA Statement
The University of Michigan is an equal opportunity/affirmative action employer.
Friday, December 16, 2011
TRAC: A3.5: Seeking and acting on feedback
A3.5 Repository has policies and procedures to ensure that feedback from producers and
users is sought and addressed over time.
The repository should be able to demonstrate that it is meeting explicit requirements, that it systematically and routinely seeks feedback from stakeholders to monitor expectations and results, and that it is responsive to the evolution of requirements.
Evidence: A policy that requires a feedback mechanism; a procedure that addresses how the repository seeks, captures, and documents responses to feedback; documentation of workflow for feedback (i.e., how feedback is used and managed); quality assurance records.
I think the market economy that keeps ICPSR in business is the very best evidence that the organization seeks input from its community, and applies that feedback to its operations, content selection, preservation strategies, and nearly every element of its business.
In practice we can see many different types of feedback mechanisms: contract renewals; annual membership renewals; biennial Organizational Representative meetings; regular ICPSR Council meetings; and, regular participation at all sorts of public forms about social science research data, digital preservation, technology, etc. It also happens electronically via social media, a helpdesk where a real person answers the phone and emails, and feedback pages on the main web portal.
In some ways it feels as if this TRAC requirement is aimed at organizations that might be funded by one community, like a national government, but used by a very different community, such as research. In a scenario where the consumers and the payers are different, it is indeed critical that there be some mechanism to collect input, or the repository could enter a kind of "zombie" state where it ceases to serve its community effectively, but the funding organization continues to fund the repository nonetheless.
That said, I do think there is room for improvement in this area for ICPSR. In particular, I think there is a great opportunity to work more closely with the individual data producers, engaging them in the curation process, and making the workflow - from deposit through eventual release - more transparent.
The repository should be able to demonstrate that it is meeting explicit requirements, that it systematically and routinely seeks feedback from stakeholders to monitor expectations and results, and that it is responsive to the evolution of requirements.
Evidence: A policy that requires a feedback mechanism; a procedure that addresses how the repository seeks, captures, and documents responses to feedback; documentation of workflow for feedback (i.e., how feedback is used and managed); quality assurance records.
I think the market economy that keeps ICPSR in business is the very best evidence that the organization seeks input from its community, and applies that feedback to its operations, content selection, preservation strategies, and nearly every element of its business.
In practice we can see many different types of feedback mechanisms: contract renewals; annual membership renewals; biennial Organizational Representative meetings; regular ICPSR Council meetings; and, regular participation at all sorts of public forms about social science research data, digital preservation, technology, etc. It also happens electronically via social media, a helpdesk where a real person answers the phone and emails, and feedback pages on the main web portal.
In some ways it feels as if this TRAC requirement is aimed at organizations that might be funded by one community, like a national government, but used by a very different community, such as research. In a scenario where the consumers and the payers are different, it is indeed critical that there be some mechanism to collect input, or the repository could enter a kind of "zombie" state where it ceases to serve its community effectively, but the funding organization continues to fund the repository nonetheless.
That said, I do think there is room for improvement in this area for ICPSR. In particular, I think there is a great opportunity to work more closely with the individual data producers, engaging them in the curation process, and making the workflow - from deposit through eventual release - more transparent.
Labels:
archival storage,
digital preservation,
infrastructure,
trac
Wednesday, December 14, 2011
Google Music keeps the tunes playing
I started using the new Google Music production service. I hadn't explored Google's previous offering, the Music Beta, all that much, but decided the time was right to dip a toe into the water.
The service has a lot of similarities to iTunes, of course, except one's library is in the cloud rather than on a PC (assuming one isn't using Apple's iCloud). Google gives one free space to store 20k songs. I'm using about 1% of that quota so far.
I like the idea of having a copy of our music in the cloud as an additional backup (or preservation copy), and it is also nice being able to use a standard browser window to manage and play the music. One complaint I have about iTunes is that because it is conventional desktop software, one has to update it from time to time. And this is somewhat more burdensome if one has to switch from a "standard" type of login on Windows to one with administrative rights, and then switch back again.
Google provides a tool which will copy music from one's existing storehouse (mine was an iTunes library). The tool worked well for this purpose, and it did NOT require any administrative rights on my home WinXP (I know, I know) to download, install, and execute. I started the copy one evening, and some 400 songs had been copied into Google Music by the morning. One feature request: It would be fabulous if the Music Manager tool would pull songs directly from a CD.
On the back-end I wonder if Google is using some form of de-duplication to minimize the amount of storage it needs to provision for this service? It must be the case that there would be great overlap between music collections, particularly with the most popular songs, artists, albums, etc. Google does such a good job of squeezing storage efficiency out of GMail; would expect them to do the same for their music service.
The service has a lot of similarities to iTunes, of course, except one's library is in the cloud rather than on a PC (assuming one isn't using Apple's iCloud). Google gives one free space to store 20k songs. I'm using about 1% of that quota so far.
I like the idea of having a copy of our music in the cloud as an additional backup (or preservation copy), and it is also nice being able to use a standard browser window to manage and play the music. One complaint I have about iTunes is that because it is conventional desktop software, one has to update it from time to time. And this is somewhat more burdensome if one has to switch from a "standard" type of login on Windows to one with administrative rights, and then switch back again.
Google provides a tool which will copy music from one's existing storehouse (mine was an iTunes library). The tool worked well for this purpose, and it did NOT require any administrative rights on my home WinXP (I know, I know) to download, install, and execute. I started the copy one evening, and some 400 songs had been copied into Google Music by the morning. One feature request: It would be fabulous if the Music Manager tool would pull songs directly from a CD.
On the back-end I wonder if Google is using some form of de-duplication to minimize the amount of storage it needs to provision for this service? It must be the case that there would be great overlap between music collections, particularly with the most popular songs, artists, albums, etc. Google does such a good job of squeezing storage efficiency out of GMail; would expect them to do the same for their music service.
Monday, December 12, 2011
ICPSR web availability through November 2011
Now that's what I'm talking about! (Click the image to see a more readable chart.)
After some shamefully low availability numbers in September and October, we've rebounded nicely in November (over 99.9%).
We saw two main problems in the month.
One was a short outage where our search engine (Solr) faulted and required a restart. We think this is due to a memory leak in Solr, and we are hoping that we can avoid the problem more completely once we move from our older 32-bit web hardware to our new 64-bit machine. I'm hoping this happens before the end of the calendar year.
The other outage was due (we think) to a campus power blip that seemed to cause a fault with our EMC storage appliance. While ICPSR never lost power, and while the machine room has a large, new UPS system, we speculate that the EMC got confused when it lost contact with UMROOT Windows Domain Controllers across campus due to the network path fluctuating. The problem solved itself after 15 minutes, and it was the only anomaly that was coincident with the EMC hanging.
ICPSR FY 2012 web availability through November 2011 |
We saw two main problems in the month.
One was a short outage where our search engine (Solr) faulted and required a restart. We think this is due to a memory leak in Solr, and we are hoping that we can avoid the problem more completely once we move from our older 32-bit web hardware to our new 64-bit machine. I'm hoping this happens before the end of the calendar year.
The other outage was due (we think) to a campus power blip that seemed to cause a fault with our EMC storage appliance. While ICPSR never lost power, and while the machine room has a large, new UPS system, we speculate that the EMC got confused when it lost contact with UMROOT Windows Domain Controllers across campus due to the network path fluctuating. The problem solved itself after 15 minutes, and it was the only anomaly that was coincident with the EMC hanging.
Friday, December 9, 2011
TRAC: A3.4: Formal, periodic review
A3.4 Repository is committed to formal, periodic review and assessment to ensure
responsiveness to technological developments and evolving requirements.
Long-term preservation is a shared and complex responsibility. A trusted digital repository contributes to and benefits from the breadth and depth of community-based standards and practice. Regular review is a requisite for ongoing and healthy development of the repository. The organizational context of the repository should determine the frequency of, extent of, and process for self-assessment. The repository must also be able to provide a specific set of requirements it has defined, is maintaining, and is striving to meet. (See also A3.9.)
Evidence: A self-assessment schedule, timetables for review and certification; results of self-assessment; evidence of implementation of review outcomes.
Steve Abrams from the California Digital Library gave an interesting talk earlier this year about the notion of applying a Neighborhood Watch metaphor to digital archives. You can find a PDF of the slideshow here.
This is a nice paradigm, and it fits well with some of the work ICPSR is doing with its Data-PASS partners. We're using the Stanford Lots of Copies Keep Stuff Safe (LOCKSS) software in a Private LOCKSS Network (PLN) to build a distributed archival storage network. And in addition to the PLN, we have also built tools to verify the integrity of the PLN and its content. We call this additional layer the SAFE-Archive, and the development has been led by the Odum Institute at the University of North Carolina.
I also see ICPSR periodically assess itself on a regular basis in response to opportunities to expand its reach thematically or technologically. For example, as ICPSR enters into the world of digital preservation for video as part of two recent grants from the Bill and Melinda Gates Foundation, this drives ICPSR to re-evaluate how it manages content.
I'm not sure that these types of activities are as formal as the TRAC requirement might like, and so the action item might look more like a documentation project rather than adding a new activity into ICPSR's standard operating procedures.
Long-term preservation is a shared and complex responsibility. A trusted digital repository contributes to and benefits from the breadth and depth of community-based standards and practice. Regular review is a requisite for ongoing and healthy development of the repository. The organizational context of the repository should determine the frequency of, extent of, and process for self-assessment. The repository must also be able to provide a specific set of requirements it has defined, is maintaining, and is striving to meet. (See also A3.9.)
Evidence: A self-assessment schedule, timetables for review and certification; results of self-assessment; evidence of implementation of review outcomes.
Steve Abrams from the California Digital Library gave an interesting talk earlier this year about the notion of applying a Neighborhood Watch metaphor to digital archives. You can find a PDF of the slideshow here.
This is a nice paradigm, and it fits well with some of the work ICPSR is doing with its Data-PASS partners. We're using the Stanford Lots of Copies Keep Stuff Safe (LOCKSS) software in a Private LOCKSS Network (PLN) to build a distributed archival storage network. And in addition to the PLN, we have also built tools to verify the integrity of the PLN and its content. We call this additional layer the SAFE-Archive, and the development has been led by the Odum Institute at the University of North Carolina.
I also see ICPSR periodically assess itself on a regular basis in response to opportunities to expand its reach thematically or technologically. For example, as ICPSR enters into the world of digital preservation for video as part of two recent grants from the Bill and Melinda Gates Foundation, this drives ICPSR to re-evaluate how it manages content.
I'm not sure that these types of activities are as formal as the TRAC requirement might like, and so the action item might look more like a documentation project rather than adding a new activity into ICPSR's standard operating procedures.
Labels:
archival storage,
digital preservation,
infrastructure,
trac
Wednesday, December 7, 2011
November 2011 deposits at ICPSR
Chart? Chart.[1]
A relatively heavy month for deposits, this November 2011. The number of SPSS files deposited is really impressive, and look to be the result of a small number of deposits, but with many data files. Quite a bit of Excel too; more than SAS and Stata combined.
We also have the usually fishy looking items that have been auto-detected as C or C++ source code, and which are actually text/plain (I suspect). If the setups for a stat package contain the right types of comments in the right places, file is easy to fool.
The dBase and ArcView files are an interesting add to this month's listing. We don't see too many of those.
[1] This is a very small homage to mgoblog.
# of files | # of deposits | File format |
1 | 1 | application/msaccess |
37 | 14 | application/msword |
14 | 3 | application/octet-stream |
161 | 24 | application/pdf |
16 | 1 | application/postscript |
440 | 10 | application/vnd.ms-excel |
1 | 1 | application/vnd.ms-powerpoint |
106 | 1 | application/x-arcview |
53 | 1 | application/x-dbase |
10 | 2 | application/x-dosexec |
1 | 1 | application/x-executable, dynamically linked (uses shared libs), not stripped |
1 | 1 | application/x-rar |
20 | 9 | application/x-sas |
8 | 1 | application/x-sharedlib, not stripped |
4 | 1 | application/x-shellscript |
9667 | 33 | application/x-spss |
23 | 4 | application/x-stata |
5 | 3 | application/x-zip |
12 | 1 | image/gif |
1 | 1 | image/jpeg |
2 | 1 | image/x-xpm0117bit |
2 | 2 | message/rfc8220117bit |
397 | 12 | text/html |
11 | 8 | text/plain; charset=iso-8859-1 |
13 | 8 | text/plain; charset=unknown |
1482 | 42 | text/plain; charset=us-ascii |
16 | 3 | text/rtf |
4 | 1 | text/x-c++; charset=us-ascii |
8 | 2 | text/x-c; charset=us-ascii |
1 | 1 | text/xml |
A relatively heavy month for deposits, this November 2011. The number of SPSS files deposited is really impressive, and look to be the result of a small number of deposits, but with many data files. Quite a bit of Excel too; more than SAS and Stata combined.
We also have the usually fishy looking items that have been auto-detected as C or C++ source code, and which are actually text/plain (I suspect). If the setups for a stat package contain the right types of comments in the right places, file is easy to fool.
The dBase and ArcView files are an interesting add to this month's listing. We don't see too many of those.
[1] This is a very small homage to mgoblog.
Monday, December 5, 2011
Collaborators, not depositors
ICPSR should stop accepting deposits.
Instead ICPSR should be recruiting collaborators.
To be sure ICPSR receives a great deal of its content via US Government agencies who have decided to outsource the digital preservation of their content to a trustworthy repository like ICPSR. In this case the relevant contract, grant, or inter-agency agreement makes it clear what content will be coming to ICPSR to be curated and preserved. In some cases the agency has little interest in depositing content ("Isn't that what we pay you for?"), and so the formal act of depositing content falls to the ICPSR staff anyway.
However, we also receive a considerable volume of content through our web portal where the depositor is external. Sometimes we have worked hard to acquire the content, and the deposit is one milestone on a very long road, but other times the content comes to us unsolicited. (I like to call these "drive-by deposits.")
In some cases the depositor is quite eager and able to help ICPSR with much of the curation work: drafting rich descriptive metadata; organizing survey data and documentation into coherent groups; packaging other types of content into logical bundles (such as with our Publication-Related Archive); and, reviewing the data for possible disclosure risks. Depositors may have access to resources like graduate students who can help with these tasks, and if the depositor is also the data producer, then s/he has valuable, unique insight into the data and documentation. Unfortunately ICPSR is not well poised to tap into that expertise and those resources.
What would it take to get there?
ICPSR could separate the transactional step of submitting content (i.e., file upload concurrent with signature) from the iterative step of preparing metadata applicable to the submitted content. In fact, one could even prepare metadata well before the submission transaction if the data producer had the interest and resources to prepare that information, but was not quite ready to share the data yet. And, it would be equally permissible to submit the data for preservation and sharing, and then build the metadata slowly during the weeks and months following the upload.
If the data producer could also export the metadata in machine actionable formats, say, DDI XML for content which maps well to the classic "study" object that ICPSR has curated and preserved for decades, then there may be additional value to the producer. And introducing the structure that comes along with an XML schema like DDI might also be valuable to the producer in terms of thinking about and organizing the documentation, even for his/her own use.
In this world the ICPSR deposit system becomes a much shorter, much simpler web application. And the ICPSR data management infrastructure would need to be opened up -- but with serious access controls -- so that content providers could access, create, and revise their documentation and metadata. But the best thing about this world is that ICPSR gains a lot of collaborators, some who would be quite eager to work with us, I think.
Instead ICPSR should be recruiting collaborators.
To be sure ICPSR receives a great deal of its content via US Government agencies who have decided to outsource the digital preservation of their content to a trustworthy repository like ICPSR. In this case the relevant contract, grant, or inter-agency agreement makes it clear what content will be coming to ICPSR to be curated and preserved. In some cases the agency has little interest in depositing content ("Isn't that what we pay you for?"), and so the formal act of depositing content falls to the ICPSR staff anyway.
However, we also receive a considerable volume of content through our web portal where the depositor is external. Sometimes we have worked hard to acquire the content, and the deposit is one milestone on a very long road, but other times the content comes to us unsolicited. (I like to call these "drive-by deposits.")
In some cases the depositor is quite eager and able to help ICPSR with much of the curation work: drafting rich descriptive metadata; organizing survey data and documentation into coherent groups; packaging other types of content into logical bundles (such as with our Publication-Related Archive); and, reviewing the data for possible disclosure risks. Depositors may have access to resources like graduate students who can help with these tasks, and if the depositor is also the data producer, then s/he has valuable, unique insight into the data and documentation. Unfortunately ICPSR is not well poised to tap into that expertise and those resources.
What would it take to get there?
ICPSR could separate the transactional step of submitting content (i.e., file upload concurrent with signature) from the iterative step of preparing metadata applicable to the submitted content. In fact, one could even prepare metadata well before the submission transaction if the data producer had the interest and resources to prepare that information, but was not quite ready to share the data yet. And, it would be equally permissible to submit the data for preservation and sharing, and then build the metadata slowly during the weeks and months following the upload.
If the data producer could also export the metadata in machine actionable formats, say, DDI XML for content which maps well to the classic "study" object that ICPSR has curated and preserved for decades, then there may be additional value to the producer. And introducing the structure that comes along with an XML schema like DDI might also be valuable to the producer in terms of thinking about and organizing the documentation, even for his/her own use.
In this world the ICPSR deposit system becomes a much shorter, much simpler web application. And the ICPSR data management infrastructure would need to be opened up -- but with serious access controls -- so that content providers could access, create, and revise their documentation and metadata. But the best thing about this world is that ICPSR gains a lot of collaborators, some who would be quite eager to work with us, I think.
Friday, December 2, 2011
TRAC: A3.3: Permission to preserve
A3.3 Repository maintains written policies that specify the nature of any legal permissions
required to preserve digital content over time, and repository can demonstrate that these
permissions have been acquired when needed.
Because the right to change or alter digital information is often restricted by law to the creator, it is important that digital repositories address the need to be able to work with and potentially modify digital objects to keep them accessible over time. Repositories should have written policies and agreements with depositors that specify and/or transfer certain rights to the repository enabling appropriate and necessary preservation actions to take place on the digital objects within the repository.
Because legal negotiations can take time, potentially slowing or preventing the ingest of digital objects at risk, a digital repository may take in or accept digital objects even with only minimal preservation rights using an open-ended agreement and address more detailed rights later. A repository’s rights must at least limit the repository’s liability or legal exposure that threatens the repository itself. A repository does not have sufficient control of the information if the repository itself is legally at risk.
Evidence: Deposit agreements; records schedule; digital preservation policies; records legislation and policies; service agreements.
ICPSR has a standard agreement that is uses for all deposits. This agreement grants ICPSR the non-exclusive right to replicate the content for preservation purposes and to deliver the content on our web site. This language resides inside of our Deposit Form web application.
This works very well for deposits that come from a known source, such as a government agency with whom we have an agreement to preserve and deliver content, or an individual researcher with whom we have been corresponding. In this case we have a good sense for who the depositor is, the role they play with regard to the data, and the mechanisms by which we can contact him/her.
Things become a bit messier what I will call a "drive-by deposit." This is an unsolicited, unexpected deposit, and in this case the depositor agrees to give us permission to make copies of the content for digital preservation purposes and to deliver the content via our web portal. That said, ICPSR does not require strong identities to execute a deposit, and so one could ask the question: How does ICPSR know that the depositor himself/herself has the authority to grant us rights to preserve and redistribute the content?
Because the right to change or alter digital information is often restricted by law to the creator, it is important that digital repositories address the need to be able to work with and potentially modify digital objects to keep them accessible over time. Repositories should have written policies and agreements with depositors that specify and/or transfer certain rights to the repository enabling appropriate and necessary preservation actions to take place on the digital objects within the repository.
Because legal negotiations can take time, potentially slowing or preventing the ingest of digital objects at risk, a digital repository may take in or accept digital objects even with only minimal preservation rights using an open-ended agreement and address more detailed rights later. A repository’s rights must at least limit the repository’s liability or legal exposure that threatens the repository itself. A repository does not have sufficient control of the information if the repository itself is legally at risk.
Evidence: Deposit agreements; records schedule; digital preservation policies; records legislation and policies; service agreements.
ICPSR has a standard agreement that is uses for all deposits. This agreement grants ICPSR the non-exclusive right to replicate the content for preservation purposes and to deliver the content on our web site. This language resides inside of our Deposit Form web application.
This works very well for deposits that come from a known source, such as a government agency with whom we have an agreement to preserve and deliver content, or an individual researcher with whom we have been corresponding. In this case we have a good sense for who the depositor is, the role they play with regard to the data, and the mechanisms by which we can contact him/her.
Things become a bit messier what I will call a "drive-by deposit." This is an unsolicited, unexpected deposit, and in this case the depositor agrees to give us permission to make copies of the content for digital preservation purposes and to deliver the content via our web portal. That said, ICPSR does not require strong identities to execute a deposit, and so one could ask the question: How does ICPSR know that the depositor himself/herself has the authority to grant us rights to preserve and redistribute the content?
Labels:
archival storage,
digital preservation,
infrastructure,
trac
Wednesday, November 30, 2011
ICPSR and the cloud
How is ICPSR using "the cloud?"
I've been getting this question a lot lately, and it feels like it's time to put together a blog post on this question.
From a functional standpoint ICPSR is using the cloud for identity and authentication, content delivery, archival storage, and data producer relationship management. And if I include services based at the University of Michigan, I might also include data curation, and customer relationship management.
From a vendor standpoint here's a roster of some of the organizations with whom we're doing business, and how their piece of the cloud helps us run our business.
A typical transaction on the ICPSR web site looks like this: Search. Select content for download. Create an ICPSR-specific identity. Authenticate using that identity. Download the content. Do not return to ICPSR for at least a year.
Given that the ICPSR-specific identities are weak (i.e., web site visitors create them by entering an arbitrary email address and password) and given that they identity is often used only once, it seemed like a good idea to eliminate the need to create such an identity. We don't need strong identities, but we do need identities that would be available to anyone. Technologies like OpenID, Facebook Connect, and the like seemed promising, but who wants to build infrastructure which talks to all of them?
Janrain does.
We use Janrain Engage as one part of our identity and authentication strategy. Janrain acts as a third party between the content provider (ICPSR) and the identity providers. And so when someone needs to log in to ICPSR's portal, they see a screen that looks something like this:
So there's no need to create an account and password at ICPSR. And if someone does return later, they don't have to log in to our site if they've already logged in to their identity provider's site. (This is Single Sign-On or SSO.)
We're hosting several web portals in Amazon's cloud. We're using Amazon's Infrastructure as a Service (IaaS) to stand-up Linux systems in the Amazon Elastic Computing Cloud (EC2) that are identical to the images we host locally. We back the instances with Elastic Block Storage (EBS) volumes so that the content persists when we need to terminate and restart a computing instance.
We also host a replica of our on-site delivery system in Amazon's cloud for disaster recovery (DR) purposes. We find that we have the opportunity to "test" this replica at least once per year when ICPSR's headquarters loses power for several hours due to high winds, ice storms, or other acts of nature.
The Amazon service has been very reliable overall (despite a few highly publicized events), and certainly more reliable than our own on-site facilities. We also like that we can scale resources up and down very quickly, and that we have clear costs associated with the infrastructure. (Anyone at an institution of higher learning who has tried to calculate the actual cost of electricity used knows what I mean.)
I've posted many times about our relationship with DuraCloud, and how we're using it as a mechanism for storing archival copies in the cloud. In many ways DuraCloud fulfills a role similar to that of Janrain Engage by providing a layer of abstraction between ICPSR's technical infrastructure and that of multiple service providers. In this case we manage one vendor and one set of bills, but have the ability to store content in the cloud storage service of multiple providers (Amazon, Rackspace, Microsoft).
The acquisitions team at ICPSR keeps an eye on grants funded by places like the National Science Foundation and the National Institutes of Health. If a grant looks like it may be producing data the team makes a note to contact the primary investigator (PI). The goal is to have a conversation with the PI to see if there will indeed be data produced, and to see if it might be a good fit for ICPSR's holdings. If so, we then try to convince the PI that depositing the content with ICPSR would be good for everyone (more data citations for the data producer; more re-use of the data by other researchers; etc.).
We had been using a home-built application to manage this content, but we found it to be a losing battle. There was never enough money or time to build the types of relationship management reporting systems that the acquisition team wanted. And so rather than trying to build a better mousetrap, we decided to rent a better mousetrap by moving the content into a professional contact/customer relationship management (CRM) system. Like Salesforce.
The University of Michigan central IT organization (ITS) also delivers a handful of services that I would consider "the cloud" even though they do not package and market them that way. File storage, trouble ticketing, and Drupal-hosting are all available from ITS, and they all look like cloud services to us because we pay for only what we use, we can scale them up and down reasonably quickly, and we do not have to deploy any local hardware or software to use them.
I've been getting this question a lot lately, and it feels like it's time to put together a blog post on this question.
From a functional standpoint ICPSR is using the cloud for identity and authentication, content delivery, archival storage, and data producer relationship management. And if I include services based at the University of Michigan, I might also include data curation, and customer relationship management.
From a vendor standpoint here's a roster of some of the organizations with whom we're doing business, and how their piece of the cloud helps us run our business.
A typical transaction on the ICPSR web site looks like this: Search. Select content for download. Create an ICPSR-specific identity. Authenticate using that identity. Download the content. Do not return to ICPSR for at least a year.
Given that the ICPSR-specific identities are weak (i.e., web site visitors create them by entering an arbitrary email address and password) and given that they identity is often used only once, it seemed like a good idea to eliminate the need to create such an identity. We don't need strong identities, but we do need identities that would be available to anyone. Technologies like OpenID, Facebook Connect, and the like seemed promising, but who wants to build infrastructure which talks to all of them?
Janrain does.
We use Janrain Engage as one part of our identity and authentication strategy. Janrain acts as a third party between the content provider (ICPSR) and the identity providers. And so when someone needs to log in to ICPSR's portal, they see a screen that looks something like this:
So there's no need to create an account and password at ICPSR. And if someone does return later, they don't have to log in to our site if they've already logged in to their identity provider's site. (This is Single Sign-On or SSO.)
We're hosting several web portals in Amazon's cloud. We're using Amazon's Infrastructure as a Service (IaaS) to stand-up Linux systems in the Amazon Elastic Computing Cloud (EC2) that are identical to the images we host locally. We back the instances with Elastic Block Storage (EBS) volumes so that the content persists when we need to terminate and restart a computing instance.
We also host a replica of our on-site delivery system in Amazon's cloud for disaster recovery (DR) purposes. We find that we have the opportunity to "test" this replica at least once per year when ICPSR's headquarters loses power for several hours due to high winds, ice storms, or other acts of nature.
The Amazon service has been very reliable overall (despite a few highly publicized events), and certainly more reliable than our own on-site facilities. We also like that we can scale resources up and down very quickly, and that we have clear costs associated with the infrastructure. (Anyone at an institution of higher learning who has tried to calculate the actual cost of electricity used knows what I mean.)
I've posted many times about our relationship with DuraCloud, and how we're using it as a mechanism for storing archival copies in the cloud. In many ways DuraCloud fulfills a role similar to that of Janrain Engage by providing a layer of abstraction between ICPSR's technical infrastructure and that of multiple service providers. In this case we manage one vendor and one set of bills, but have the ability to store content in the cloud storage service of multiple providers (Amazon, Rackspace, Microsoft).
The acquisitions team at ICPSR keeps an eye on grants funded by places like the National Science Foundation and the National Institutes of Health. If a grant looks like it may be producing data the team makes a note to contact the primary investigator (PI). The goal is to have a conversation with the PI to see if there will indeed be data produced, and to see if it might be a good fit for ICPSR's holdings. If so, we then try to convince the PI that depositing the content with ICPSR would be good for everyone (more data citations for the data producer; more re-use of the data by other researchers; etc.).
We had been using a home-built application to manage this content, but we found it to be a losing battle. There was never enough money or time to build the types of relationship management reporting systems that the acquisition team wanted. And so rather than trying to build a better mousetrap, we decided to rent a better mousetrap by moving the content into a professional contact/customer relationship management (CRM) system. Like Salesforce.
The University of Michigan central IT organization (ITS) also delivers a handful of services that I would consider "the cloud" even though they do not package and market them that way. File storage, trouble ticketing, and Drupal-hosting are all available from ITS, and they all look like cloud services to us because we pay for only what we use, we can scale them up and down reasonably quickly, and we do not have to deploy any local hardware or software to use them.
Monday, November 28, 2011
Hi Ho, a Googling we will go!
The University of Michigan announced (on Halloween! - I hope this is not a trick!) that it will be adopting Google as its collaboration platform. The roll-out will happen over the course of the next year, and includes tools such as Gmail, Sites, Docs, Calendar, Blogger, and more.
I am delighted.
I've been using Google's Blogger technology (obviously) for some time to publish the Tech@ICPSR blog, and use Google Docs for almost any project where I would have used Microsoft Office in the past. (I do still use PowerPoint from time-to-time if I need something fancy-schmancy, and don't have the time to conceptualize it as a Prezi instead.)
The biggest win for ICPSR, however, is with Gmail and Calendar.
When I arrived at ICPSR in 2002 we were running our own IMAP-based service with Eudora as the supported client. And by supported I mean that we installed the free "hey look at these ads" version on each person's machine. Off-site access was the responsibility of the individual, although we did hook it up to a campus webmail front-end eventually. We were running MeetingMaker as our supported calendar client. And by supported I mean that we installed the client on everyone's machine, but no one used it.
Sometime in 2005 or so we realized that it wasn't much fun running email and calendar services, and we also noted that we were already paying for an enterprise mail/calendar system that our parent organization, the Institute for Social Research (ISR), operated on the Exchange platform. And so we dumped Eudroa and MeetingMaker and started using the Microsoft stack instead.
I was delighted.
However......
I soon experienced the harsh realities of life in the Microsoft stack. Small mailbox quotas. Feature-poor webmail experience. Mailbox "archives" living in one-off files on my PC or file server. And have you ever tried to find the full email headers in a piece of email stored on an Exchange server? And like our days of running Eudora and MeetingMaker we continued to be isolated from the rest of campus since our Exchange system was local to ISR and not part of a campus-wide solution.
The honeymoon had ended.
I solved the problem for myself (sort of) by maintaining my "internal to ISR" meetings and email on the ISR Exchange server, but moving my "external" meetings and email to Google. That is, I changed the U-M address book so that my bryan (at) umich.edu email address was routed to Gmail rather than the Exchange server. And so when it comes to communicating with the world outside of the ISR, I have a rich email experience that works well in any web browser, superb mail searching, and despite not deleting a single piece of non-spam email in nearly three years, I have used less than 20% of my mail quota. At this rate, I will need to delete my first email in 2024. Nice. Of course, the problem is that I now check email and calendars in two places: MS Exchange (for my ISR world) and Google (for everything else).
And so I am looking forward to the day in 2012 when it all dovetails back together and there is just one place to check my mailbox and calendar again.
I am delighted.
I've been using Google's Blogger technology (obviously) for some time to publish the Tech@ICPSR blog, and use Google Docs for almost any project where I would have used Microsoft Office in the past. (I do still use PowerPoint from time-to-time if I need something fancy-schmancy, and don't have the time to conceptualize it as a Prezi instead.)
The biggest win for ICPSR, however, is with Gmail and Calendar.
When I arrived at ICPSR in 2002 we were running our own IMAP-based service with Eudora as the supported client. And by supported I mean that we installed the free "hey look at these ads" version on each person's machine. Off-site access was the responsibility of the individual, although we did hook it up to a campus webmail front-end eventually. We were running MeetingMaker as our supported calendar client. And by supported I mean that we installed the client on everyone's machine, but no one used it.
Sometime in 2005 or so we realized that it wasn't much fun running email and calendar services, and we also noted that we were already paying for an enterprise mail/calendar system that our parent organization, the Institute for Social Research (ISR), operated on the Exchange platform. And so we dumped Eudroa and MeetingMaker and started using the Microsoft stack instead.
I was delighted.
However......
I soon experienced the harsh realities of life in the Microsoft stack. Small mailbox quotas. Feature-poor webmail experience. Mailbox "archives" living in one-off files on my PC or file server. And have you ever tried to find the full email headers in a piece of email stored on an Exchange server? And like our days of running Eudora and MeetingMaker we continued to be isolated from the rest of campus since our Exchange system was local to ISR and not part of a campus-wide solution.
The honeymoon had ended.
I solved the problem for myself (sort of) by maintaining my "internal to ISR" meetings and email on the ISR Exchange server, but moving my "external" meetings and email to Google. That is, I changed the U-M address book so that my bryan (at) umich.edu email address was routed to Gmail rather than the Exchange server. And so when it comes to communicating with the world outside of the ISR, I have a rich email experience that works well in any web browser, superb mail searching, and despite not deleting a single piece of non-spam email in nearly three years, I have used less than 20% of my mail quota. At this rate, I will need to delete my first email in 2024. Nice. Of course, the problem is that I now check email and calendars in two places: MS Exchange (for my ISR world) and Google (for everything else).
And so I am looking forward to the day in 2012 when it all dovetails back together and there is just one place to check my mailbox and calendar again.
Friday, November 25, 2011
TRAC: A3.2: Written policies and procedures
A3.2 Repository has procedures and policies in place, and mechanisms for their review,
update, and development as the repository grows and as technology and community
practice evolve.
The policies and procedures of the repository must be complete, written or available in a tangible form, remain current, and must evolve to reflect changes in requirements and practice. The repository must demonstrate that a policy and procedure audit and maintenance is in place and regularly applied. Policies and procedures should address core areas, including, for example, transfer requirements, submission, quality control, storage management, disaster planning, metadata management, access, rights management, preservation strategies, staffing, and security. High-level documents should make organizational commitments and intents clear. Lower-level documents should make day-to-day practice and procedure clear. Versions of these documents must be well managed by the repository (e.g., outdated versions are clearly identified or maintained offline) and qualified staff and peers must be involved in reviewing, updating, and extending these documents. The repository should be able to document the results of monitoring for relevant developments; responsiveness to prevailing standards and practice, emerging requirements, and standards that are specific to the domain, if appropriate; and similar developments. The repository should be able to demonstrate that it has defined "comprehensive documentation" for the repository. See Appendix 3: Minimum Required Documents for more information.
Evidence: Written documentation in the form of policies, procedures, protocols, rules, manuals, handbooks, and workflows; specification of review cycle for documentation; documentation detailing review, update, and development mechanisms. If documentation is embedded in system logic, functionality should demonstrate the implementation of policies and procedures.
A deliverable for a recent contract was something the client called an information system security plan. Our understanding was that in past contracts this was always understood to be a short document (2-3 pages) that summarized ICPSR's IT systems, and described the measures taken by ICPSR to protect them from unauthorized use. No big deal, right?
However, .....
In this most recent contract the security plan implementation details changed; rather than a brief summary document, the requirement was now two-fold.
The first deliverable consisted of a document showing the Federal Information Processing Standards (FIPS) categorization of the risks associated with our IT systems. This document was based on a standard known as FIPS Publication 199. It turns out that this methodology and level of documentation is relatively lightweight.
In brief, one asserts one of three levels (Low, Medium, High) of risk. There was never a question of asserting High risk, and so the choice as to select either Low or Medium. We worked with the University of Michigan's central IT security office, and based on the type of data preserved at ICPSR, they recommended that we select Low.
The second part required us to document the security controls defined by the National Institute of Standards and Technology related to a FIPS-199 categorization of Low risk. This standard is described in NIST Special Publication 800-53, and requires a very high level of documentation. (The standard is very heavy on policy and documentation, but very light on measurement and audit, and therefore some critics believe that this is a major flaw in the approach.)
Our NIST 800-53 security control documentation ran nearly 200 pages(!), and this page-count does not include documents which are required, but external, to 800-53. For example, if 800-53 requires one to assert that there is a policy on topic X, one does not need to include the policy within the 800-53 security controls documentation, but it does require one to write the policy on topic X (if it does not already exist). And so between the 800-53 controls and the external documents, our guess is that this ran well over 250 pages.
And so we are very well supplied with policies and procedures, and we even have the documentation to prove it now.
The policies and procedures of the repository must be complete, written or available in a tangible form, remain current, and must evolve to reflect changes in requirements and practice. The repository must demonstrate that a policy and procedure audit and maintenance is in place and regularly applied. Policies and procedures should address core areas, including, for example, transfer requirements, submission, quality control, storage management, disaster planning, metadata management, access, rights management, preservation strategies, staffing, and security. High-level documents should make organizational commitments and intents clear. Lower-level documents should make day-to-day practice and procedure clear. Versions of these documents must be well managed by the repository (e.g., outdated versions are clearly identified or maintained offline) and qualified staff and peers must be involved in reviewing, updating, and extending these documents. The repository should be able to document the results of monitoring for relevant developments; responsiveness to prevailing standards and practice, emerging requirements, and standards that are specific to the domain, if appropriate; and similar developments. The repository should be able to demonstrate that it has defined "comprehensive documentation" for the repository. See Appendix 3: Minimum Required Documents for more information.
Evidence: Written documentation in the form of policies, procedures, protocols, rules, manuals, handbooks, and workflows; specification of review cycle for documentation; documentation detailing review, update, and development mechanisms. If documentation is embedded in system logic, functionality should demonstrate the implementation of policies and procedures.
A deliverable for a recent contract was something the client called an information system security plan. Our understanding was that in past contracts this was always understood to be a short document (2-3 pages) that summarized ICPSR's IT systems, and described the measures taken by ICPSR to protect them from unauthorized use. No big deal, right?
However, .....
In this most recent contract the security plan implementation details changed; rather than a brief summary document, the requirement was now two-fold.
The first deliverable consisted of a document showing the Federal Information Processing Standards (FIPS) categorization of the risks associated with our IT systems. This document was based on a standard known as FIPS Publication 199. It turns out that this methodology and level of documentation is relatively lightweight.
In brief, one asserts one of three levels (Low, Medium, High) of risk. There was never a question of asserting High risk, and so the choice as to select either Low or Medium. We worked with the University of Michigan's central IT security office, and based on the type of data preserved at ICPSR, they recommended that we select Low.
The second part required us to document the security controls defined by the National Institute of Standards and Technology related to a FIPS-199 categorization of Low risk. This standard is described in NIST Special Publication 800-53, and requires a very high level of documentation. (The standard is very heavy on policy and documentation, but very light on measurement and audit, and therefore some critics believe that this is a major flaw in the approach.)
Our NIST 800-53 security control documentation ran nearly 200 pages(!), and this page-count does not include documents which are required, but external, to 800-53. For example, if 800-53 requires one to assert that there is a policy on topic X, one does not need to include the policy within the 800-53 security controls documentation, but it does require one to write the policy on topic X (if it does not already exist). And so between the 800-53 controls and the external documents, our guess is that this ran well over 250 pages.
And so we are very well supplied with policies and procedures, and we even have the documentation to prove it now.
Labels:
archival storage,
digital preservation,
infrastructure,
trac
Wednesday, November 23, 2011
InfoWorld Geek IQ Test - 2011
I took the 2011 InfoWorld geek IQ test. I knew the answers to some of the more techie questions, especially when they were related to networking (CIDR, DNS), but didn't do so well on the pop culture items. Got a 65 which places me between Geek dilettante and Marketing Executive.
I haven't decided yet whether I'm happy or ashamed.
I haven't decided yet whether I'm happy or ashamed.
Monday, November 21, 2011
Firing clients
Seth Godin has published another gem: The Unreasonable Customer.
In this post he argues that while there are certain circumstances where maintaining a relationship with an unreasonable customer is justified, in many cases it makes no sense. This is spot-on advice.
Some clients are demanding, of course, but some are demanding in very constructive, very actionable ways. The client who pushes ICPSR, say, to deliver content in more interesting, more innovative ways may be difficult, but ultimately makes ICPSR a stronger organization with better services.
But the client who makes demands which are unreasonable, and which take resources away from better serving the other clients only weakens the organization. Instead of making services better, the organization struggles hopelessly to appease the unreasonable client. Resources and time are lost. Staff become exhausted and disillusioned. Morale sinks.
In the olden days of working in the telecom industry in the mid 1990s I remember a case where a train had derailed and it had torn up a bunch of fiber near the Washington, DC area. A handful of our clients had consequently lost their network connections. Our company was doing the right things: We informed the clients about the problem, and we were keeping close watch on the fiber restoration project, pushing the supplier (I think it might have been MCI) to give our circuits the top priority. While no one was delighted to be without their Internet connection, they understood that the cause was beyond our control, and that they had made the decision to purchase only a single Internet connection from a single company. (Clients who needed very, very high availability would routinely purchase multiple Internet connections from multiple providers.)
One client, however, refused to let the team work through the problem. This client wasn't interested in service restoration; this client wanted to take out all of his frustration on the team. "You're incompetent!" "You should all be fired!" "This is unacceptable!"
I tried to calm the client. Maybe we could set up something short-term over a dial-up line? And maybe long-term the right solution is to have more than one Internet connection so that if another train derails (this seemed to happen way more than one would expect) or there is a natural disaster, you'll still have your Internet connectivity?
Nothing worked. It was clear that this one client didn't want help; he wanted a punching bag.
So we fired him.
"You're right. It sounds like we're just not the right provider for you. We can't meet your expectations. We won't waste any more of your time trying to restore your service. We'll need you to send back the router, or we will have to bill you for it. Best wishes, and good luck with your next provider."
That did more for morale than the last ten company picnics and holiday parties combined.
I honestly don't remember if we did end up firing the client, or if just the threat ended his hysterics. But it definitely changed the relationship, and it proved to the team that we wouldn't let unreasonable people stop them from doing good work.
In this post he argues that while there are certain circumstances where maintaining a relationship with an unreasonable customer is justified, in many cases it makes no sense. This is spot-on advice.
Some clients are demanding, of course, but some are demanding in very constructive, very actionable ways. The client who pushes ICPSR, say, to deliver content in more interesting, more innovative ways may be difficult, but ultimately makes ICPSR a stronger organization with better services.
But the client who makes demands which are unreasonable, and which take resources away from better serving the other clients only weakens the organization. Instead of making services better, the organization struggles hopelessly to appease the unreasonable client. Resources and time are lost. Staff become exhausted and disillusioned. Morale sinks.
In the olden days of working in the telecom industry in the mid 1990s I remember a case where a train had derailed and it had torn up a bunch of fiber near the Washington, DC area. A handful of our clients had consequently lost their network connections. Our company was doing the right things: We informed the clients about the problem, and we were keeping close watch on the fiber restoration project, pushing the supplier (I think it might have been MCI) to give our circuits the top priority. While no one was delighted to be without their Internet connection, they understood that the cause was beyond our control, and that they had made the decision to purchase only a single Internet connection from a single company. (Clients who needed very, very high availability would routinely purchase multiple Internet connections from multiple providers.)
One client, however, refused to let the team work through the problem. This client wasn't interested in service restoration; this client wanted to take out all of his frustration on the team. "You're incompetent!" "You should all be fired!" "This is unacceptable!"
I tried to calm the client. Maybe we could set up something short-term over a dial-up line? And maybe long-term the right solution is to have more than one Internet connection so that if another train derails (this seemed to happen way more than one would expect) or there is a natural disaster, you'll still have your Internet connectivity?
Nothing worked. It was clear that this one client didn't want help; he wanted a punching bag.
So we fired him.
"You're right. It sounds like we're just not the right provider for you. We can't meet your expectations. We won't waste any more of your time trying to restore your service. We'll need you to send back the router, or we will have to bill you for it. Best wishes, and good luck with your next provider."
That did more for morale than the last ten company picnics and holiday parties combined.
I honestly don't remember if we did end up firing the client, or if just the threat ended his hysterics. But it definitely changed the relationship, and it proved to the team that we wouldn't let unreasonable people stop them from doing good work.
Friday, November 18, 2011
TRAC: A3.1: Designated community
A3.1 Repository has defined its designated community(ies) and associated knowledge
base(s) and has publicly accessible definitions and policies in place to dictate how its
preservation service requirements will be met.
The definition of the designated community(ies) (producer and user community) is arrived at through the planning processes used to create the repository and define its services. The definition will be drawn from various sources ranging from market research to service-level agreements for producers to the mission or scope of the institution within which the repository is embedded.
Meeting the needs of the designated community—the expected understandability of the information, not just access to it—will affect the digital object management, as well as the technical infrastructure of the overall repository. For appropriate long-term planning, the repository or organization must understand and institute policies to support these needs.
For a given submission of information, the repository must make clear the operational definition of understandability that is associated with the corresponding designated community(ies). The designated community(ies) may vary from one submission to another, as may the definition of understandability that establishes the repository’s responsibility in this area. This may range from no responsibility, if bits are only to be preserved, to the maintenance of a particular level of use, if understanding by the members of the designated community(ies) is determined outside the repository, to a responsibility for ensuring a given level of designated community(ies) human understanding, requiring appropriate Representation Information.
The documentation of understandability will typically include a definition of the applications the designated community(ies) will use with the information, possibly after transformation by repository services. For example, if a designated community is defined as readers of English with access to widely available document rendering tools, and if this definition is clearly associated with a given set of Content Information and Preservation Description Information, then the requirement is met.
Examples of designated community definitions include:
Documentation for this TRAC requirement can be found in ICPSR's mission statement (published on our web portal) and in our deposit agreements.
The definition of the designated community(ies) (producer and user community) is arrived at through the planning processes used to create the repository and define its services. The definition will be drawn from various sources ranging from market research to service-level agreements for producers to the mission or scope of the institution within which the repository is embedded.
Meeting the needs of the designated community—the expected understandability of the information, not just access to it—will affect the digital object management, as well as the technical infrastructure of the overall repository. For appropriate long-term planning, the repository or organization must understand and institute policies to support these needs.
For a given submission of information, the repository must make clear the operational definition of understandability that is associated with the corresponding designated community(ies). The designated community(ies) may vary from one submission to another, as may the definition of understandability that establishes the repository’s responsibility in this area. This may range from no responsibility, if bits are only to be preserved, to the maintenance of a particular level of use, if understanding by the members of the designated community(ies) is determined outside the repository, to a responsibility for ensuring a given level of designated community(ies) human understanding, requiring appropriate Representation Information.
The documentation of understandability will typically include a definition of the applications the designated community(ies) will use with the information, possibly after transformation by repository services. For example, if a designated community is defined as readers of English with access to widely available document rendering tools, and if this definition is clearly associated with a given set of Content Information and Preservation Description Information, then the requirement is met.
Examples of designated community definitions include:
- General English-reading public educated to high school and above, with access to a Web Browser (HTML 4.0 capable).
- For GIS data: GIS researchers—undergraduates and above—having an understanding of the concepts of Geographic data and having access to current (2005, USA) GIS tools/computer software, e.g., ArcInfo (2005).
- Astronomer (undergraduate and above) with access to FITS software such as FITSIO, familiar with astronomical spectrographic instruments.
- Student of Middle English with an understanding of TEI encoding and access to an XML rendering environment.
- Variant 1: Cannot understand TEI
- Variant 2: Cannot understand TEI and no access to XML rendering environment
- Variant 3: No understanding of Middle English but does understand TEI and XML
- Two groups: the publishers of scholarly journals and their readers, each of whom have different rights to access material and different services offered to them.
Documentation for this TRAC requirement can be found in ICPSR's mission statement (published on our web portal) and in our deposit agreements.
Labels:
archival storage,
digital preservation,
infrastructure,
trac
Wednesday, November 16, 2011
DuraCloud Archiving and Preservation Webinar
Shameless self-promotion alert...
The nice folks at DuraSpace have published the audio and video from the recent webinar that Michele Kimpton (CEO, DuraSpace) and I gave on DuraCloud.
Michele spends the first 5-10 minutes talking about the business case behind DuraCloud, and then I spend about 30 minutes talking about ICPSR and how we came to use DuraCloud to store a copy of our archival holdings.
The nice folks at DuraSpace have published the audio and video from the recent webinar that Michele Kimpton (CEO, DuraSpace) and I gave on DuraCloud.
Michele spends the first 5-10 minutes talking about the business case behind DuraCloud, and then I spend about 30 minutes talking about ICPSR and how we came to use DuraCloud to store a copy of our archival holdings.
Monday, November 14, 2011
A dangerous combination
Do you like irony?
It turns out that the University of Michigan, like many other organizations, has decided to use the cloud for keeping track of its "Travel and Expense" software and reporting, and has therefore adopted Concur.
I think that the University has made a good decision to put this in the cloud, and to look to use a hosted solution (Software as a Service (SaaS)). Using an existing service makes much more sense than building our own software. How could the U-M build a better application than a company that makes its living doing exactly this sort of thing?
Now, this isn't to say that I am a huge fan of Concur (or at least how it has been implemented at U-M). I don't find the workflow or interface to be all that intuitive, and there are a couple of things that really trip me up all the time. For example, when entering the name of someone, sometimes I am supposed to enter their LAST name and sometimes I am supposed to enter their FIRST name, and I can never remember which to enter. (Cue sad music.)
But the really challenging part about using this cloud service is when I use it to pay for cloud services. (Cue ironic music.)
Each month I get a bill from Amazon. And DuraCloud. And another one from DuraCloud (because we use more space than our membership allows.) And another one from Amazon. (Two different projects with different credit cards and different pools of machines.) And Salesforce. And....
So each month I print the invoice to PDF. And I fetch the receipt from my university credit card, and PDF that too. And then I bundle them together in an expense report in Concur. And that's when the trouble starts: How do I classify the expense?
This is almost certainly not the fault of the Concur software, of course. The problem is in the controlled vocabulary of "expense types" that the U-M has plugged into the system. Not one is a good fit for paying cloud providers. And so I pick one from the choices I do have.
Computer maintenance?
Computer rental?
Memberships (especially for the DuraSpace one, which is indeed a membership)?
Other?
My expense report is reviewed by at least four different people (two within ICPSR, at least one within our parent organization, the Institute for Social Research, at at least one at the U-M central Business and Finance unit). If any one of them believes that I have selected the wrong expense type, the report returns to me, and I then must resubmit it. The good news is that I don't have to reload the invoice or receipt, and so the process is relatively simple.
But for those of you about to implement Concur or another expense and travel reporting system, please add a new expense category for your IT managers: Cloud computing services.
It turns out that the University of Michigan, like many other organizations, has decided to use the cloud for keeping track of its "Travel and Expense" software and reporting, and has therefore adopted Concur.
I think that the University has made a good decision to put this in the cloud, and to look to use a hosted solution (Software as a Service (SaaS)). Using an existing service makes much more sense than building our own software. How could the U-M build a better application than a company that makes its living doing exactly this sort of thing?
Now, this isn't to say that I am a huge fan of Concur (or at least how it has been implemented at U-M). I don't find the workflow or interface to be all that intuitive, and there are a couple of things that really trip me up all the time. For example, when entering the name of someone, sometimes I am supposed to enter their LAST name and sometimes I am supposed to enter their FIRST name, and I can never remember which to enter. (Cue sad music.)
But the really challenging part about using this cloud service is when I use it to pay for cloud services. (Cue ironic music.)
Each month I get a bill from Amazon. And DuraCloud. And another one from DuraCloud (because we use more space than our membership allows.) And another one from Amazon. (Two different projects with different credit cards and different pools of machines.) And Salesforce. And....
So each month I print the invoice to PDF. And I fetch the receipt from my university credit card, and PDF that too. And then I bundle them together in an expense report in Concur. And that's when the trouble starts: How do I classify the expense?
This is almost certainly not the fault of the Concur software, of course. The problem is in the controlled vocabulary of "expense types" that the U-M has plugged into the system. Not one is a good fit for paying cloud providers. And so I pick one from the choices I do have.
Computer maintenance?
Computer rental?
Memberships (especially for the DuraSpace one, which is indeed a membership)?
Other?
My expense report is reviewed by at least four different people (two within ICPSR, at least one within our parent organization, the Institute for Social Research, at at least one at the U-M central Business and Finance unit). If any one of them believes that I have selected the wrong expense type, the report returns to me, and I then must resubmit it. The good news is that I don't have to reload the invoice or receipt, and so the process is relatively simple.
But for those of you about to implement Concur or another expense and travel reporting system, please add a new expense category for your IT managers: Cloud computing services.
Labels:
amazon web servivces,
cloud computing,
duracloud,
duraspace,
fun
Friday, November 11, 2011
TRAC: A2.3: Keeping ahead of the curve
A2.3 Repository has an active professional development program in place that provides
staff with skills and expertise development opportunities.
Technology will continue to change, so the repository must ensure that its staff’s skill sets evolve, ideally through a lifelong learning approach to developing and retaining staff. As the requirements and expectations pertaining to each functional area evolve, the repository must demonstrate that staff are prepared to face new challenges.
Evidence: Professional development plans and reports; training requirements and training budgets, documentation of training expenditures (amount per staff); performance goals and documentation of staff assignments and achievements, copies of certificates awarded.
ICPSR does a very good job organization-wide at encouraging staff to develop professionally. This manifests itself in several different ways: attending workshops and seminars; taking courses to learn new skills and abilities; and, most impressively, funding continuing education for staff who want to pursue a degree (often a graduate degree). In the technology shop we take advantage of the generous professional development budget in many ways.
The operations team (managed by Asmat Noori) is always bringing new technology into ICPSR: new storage systems, new backup systems, new versions of software and operating systems, and more. And so there is a recurring need for people to attend training workshops so that they can manage new technologies effectively and efficiently. Further, some types of training - such as SANS security training - cuts across many different technologies, and needs to be renewed every few years, and the training budget also supports this type of activity.
The software development team (managed by Nathan Adams) also makes regular use of professional development. In this case the new technology is usually a software system or development system component, not a new type of hardware. And we will sometimes bring a trainer on-site to deliver a course to our entire team rather than sending people away to a class. Nathan also has one staff member taking advantage of the tuition package offered by ICPSR, and is enrolled in the University of Michigan's School of Information Master's degree program.
Technology will continue to change, so the repository must ensure that its staff’s skill sets evolve, ideally through a lifelong learning approach to developing and retaining staff. As the requirements and expectations pertaining to each functional area evolve, the repository must demonstrate that staff are prepared to face new challenges.
Evidence: Professional development plans and reports; training requirements and training budgets, documentation of training expenditures (amount per staff); performance goals and documentation of staff assignments and achievements, copies of certificates awarded.
ICPSR does a very good job organization-wide at encouraging staff to develop professionally. This manifests itself in several different ways: attending workshops and seminars; taking courses to learn new skills and abilities; and, most impressively, funding continuing education for staff who want to pursue a degree (often a graduate degree). In the technology shop we take advantage of the generous professional development budget in many ways.
The operations team (managed by Asmat Noori) is always bringing new technology into ICPSR: new storage systems, new backup systems, new versions of software and operating systems, and more. And so there is a recurring need for people to attend training workshops so that they can manage new technologies effectively and efficiently. Further, some types of training - such as SANS security training - cuts across many different technologies, and needs to be renewed every few years, and the training budget also supports this type of activity.
The software development team (managed by Nathan Adams) also makes regular use of professional development. In this case the new technology is usually a software system or development system component, not a new type of hardware. And we will sometimes bring a trainer on-site to deliver a course to our entire team rather than sending people away to a class. Nathan also has one staff member taking advantage of the tuition package offered by ICPSR, and is enrolled in the University of Michigan's School of Information Master's degree program.
Labels:
archival storage,
digital preservation,
infrastructure,
trac
Wednesday, November 9, 2011
ICPSR's Secure Data Environment overview
Jenna Tyson is a graphic artist on staff at ICPSR. Over the fast few years Jenna has helped me out with displays for poster sessions, transforming the mediocre layout I produce with a true work of art. I've posted some of her work here in the past.
I asked Jenna if she could create a logo for our Secure Data Environment (SDE), and above you can see the one that I liked best. I leave it as an exercise to the reader to decide if the terrified individual in the picture is a defeated intruder or a frustrated ICPSR data curator.
The blog contains several posts that go into some detail about the software and security components behind the SDE, but I'm not sure that I ever posted a high-level description to set context, scope, and purpose. And so along with Jenna's logo, I present the story behind the SDE.
The ICPSR Secure Data Environment (SDE) is a protected work area that uses technology and process to protect sensitive social science research data from accidental or deliberate disclosure. The SDE exploits commonly used security technologies such as firewalls, ActiveDirectory group policies, and network segmentation to minimize unwanted access between the SDE and outside world. Further, it takes advantage of work processes which require strict control of when data may be moved between the SDE and external locations.
Data enter the SDE through ICPSR's deposit system. Depositors upload their content to a web application on our public web portal where it is encrypted. An automated process "sweeps" content from the portal several times per hour, moving it to the SDE, where it is then unencrypted. The content resides on a special-purpose EMC Network Attached Storage (NAS) appliance which services ICPSR's SDE. The appliance uses private IP address space which is only routed within the University of Michigan enterprise network, and is also protected by a firewall. Further, NAS shares are exported only to specific machines and only to specific ActiveDirectory groups.
ICPSR data managers must be located on the University of Michigan enterprise network to access the SDE. (They may use the University of Michigan VPN client to access the network from remote locations, and this requires strong authentication and implements strong encryption.) Data managers run a simple utility to "log in" to the SDE. Once logged into the SDE they are assigned to a disposable virtual Windows 7 desktop system which is configured to persist any content on the ICPSR SDE NAS. Any content stored on the virtual desktop system is destroyed once the image is terminated.
Data curators are not allowed to access the Internet or email within the SDE, and they do not have access to local system ports (e.g., USB). Clipboards are NOT shared between the SDE and the local machine, and so there is no ability to "cut and paste" between the two environments. It is possible, of course, for data curators to take notes from what they see on the screen, and to take screen snapshots, but ICPSR management considers these to be acceptable risks.
Data curators may release data from the SDE via two mechanisms.
One, they may submit completed work for release via an internal work system called turnover. This queues material for placement in archival storage, and also queues related material for release on the web site. A release manager reviews all content before allowing it on the web site.
Two, they may submit unfinished work for transfer outside of the SDE. In this case a request appears in the inbox of the data curator's supervisor who may then review the request, and then accept or reject it. If accepted the content is available to the data curator through a simple file retrieval mechanism, and the transfer is logged.
ICPSR has contracted the services of a "white hat" ethical hacker to assess the security vulnerabilities on the SDE. ICPSR has already implemented small changes within the SDE based on preliminary reports from the contractor.
I asked Jenna if she could create a logo for our Secure Data Environment (SDE), and above you can see the one that I liked best. I leave it as an exercise to the reader to decide if the terrified individual in the picture is a defeated intruder or a frustrated ICPSR data curator.
The blog contains several posts that go into some detail about the software and security components behind the SDE, but I'm not sure that I ever posted a high-level description to set context, scope, and purpose. And so along with Jenna's logo, I present the story behind the SDE.
The ICPSR Secure Data Environment (SDE) is a protected work area that uses technology and process to protect sensitive social science research data from accidental or deliberate disclosure. The SDE exploits commonly used security technologies such as firewalls, ActiveDirectory group policies, and network segmentation to minimize unwanted access between the SDE and outside world. Further, it takes advantage of work processes which require strict control of when data may be moved between the SDE and external locations.
Data enter the SDE through ICPSR's deposit system. Depositors upload their content to a web application on our public web portal where it is encrypted. An automated process "sweeps" content from the portal several times per hour, moving it to the SDE, where it is then unencrypted. The content resides on a special-purpose EMC Network Attached Storage (NAS) appliance which services ICPSR's SDE. The appliance uses private IP address space which is only routed within the University of Michigan enterprise network, and is also protected by a firewall. Further, NAS shares are exported only to specific machines and only to specific ActiveDirectory groups.
ICPSR data managers must be located on the University of Michigan enterprise network to access the SDE. (They may use the University of Michigan VPN client to access the network from remote locations, and this requires strong authentication and implements strong encryption.) Data managers run a simple utility to "log in" to the SDE. Once logged into the SDE they are assigned to a disposable virtual Windows 7 desktop system which is configured to persist any content on the ICPSR SDE NAS. Any content stored on the virtual desktop system is destroyed once the image is terminated.
Data curators are not allowed to access the Internet or email within the SDE, and they do not have access to local system ports (e.g., USB). Clipboards are NOT shared between the SDE and the local machine, and so there is no ability to "cut and paste" between the two environments. It is possible, of course, for data curators to take notes from what they see on the screen, and to take screen snapshots, but ICPSR management considers these to be acceptable risks.
Data curators may release data from the SDE via two mechanisms.
One, they may submit completed work for release via an internal work system called turnover. This queues material for placement in archival storage, and also queues related material for release on the web site. A release manager reviews all content before allowing it on the web site.
Two, they may submit unfinished work for transfer outside of the SDE. In this case a request appears in the inbox of the data curator's supervisor who may then review the request, and then accept or reject it. If accepted the content is available to the data curator through a simple file retrieval mechanism, and the transfer is logged.
ICPSR has contracted the services of a "white hat" ethical hacker to assess the security vulnerabilities on the SDE. ICPSR has already implemented small changes within the SDE based on preliminary reports from the contractor.
Monday, November 7, 2011
October 2011 deposits at ICPSR
Time again for the monthly report of new deposits at ICPSR. Here is the snapshot from October 2011:
The volumes are a bit higher this month, especially the number of files. At least some of the deposits must have been large, containing an unusually large number of files.
In addition to the usual suspects - plain text, stat package formats, MS Word, PDF - we have a very large number of unidentified files this month (2400+ application/octet-stream), and we also have a very small number of interesting formats (images, photoshop).
# of files | # of deposits | File format |
4 | 3 | application/msaccess |
46 | 3 | application/msoffice |
2496 | 31 | application/msword |
2415 | 7 | application/octet-stream |
489 | 45 | application/pdf |
93 | 16 | application/vnd.ms-excel |
6 | 1 | application/vnd.wordperfect |
15 | 2 | application/x-dbase |
2 | 1 | application/x-empty |
1130 | 14 | application/x-sas |
1867 | 31 | application/x-spss |
1193 | 14 | application/x-stata |
1 | 1 | image/gif |
1 | 1 | image/jpeg |
2 | 2 | image/png |
4 | 1 | image/tiff |
1 | 1 | image/x-photoshop |
10 | 5 | message/rfc8220117bit |
151 | 2 | text/html |
2 | 2 | text/html; charset=us-ascii |
114 | 8 | text/plain; charset=iso-8859-1 |
50 | 8 | text/plain; charset=unknown |
5047 | 33 | text/plain; charset=us-ascii |
67 | 1 | text/plain; charset=utf-8 |
4 | 3 | text/rtf |
66 | 7 | text/x-c++; charset=us-ascii |
1 | 1 | text/x-c++; charset=utf-8 |
211 | 7 | text/x-c; charset=us-ascii |
98 | 6 | text/xml |
The volumes are a bit higher this month, especially the number of files. At least some of the deposits must have been large, containing an unusually large number of files.
In addition to the usual suspects - plain text, stat package formats, MS Word, PDF - we have a very large number of unidentified files this month (2400+ application/octet-stream), and we also have a very small number of interesting formats (images, photoshop).
Friday, November 4, 2011
TRAC: A2.2: The right quantity of staff and skills
A2.2 Repository has the appropriate number of staff to support all functions and services.
Staffing for the repository must be adequate for the scope and mission of the archiving program. The repository should be able to demonstrate an effort to determine the appropriate number and level of staff that corresponds to requirements and commitments. (These requirements are related to the core functionality covered by a certification process. Of particular interest to repository certification is whether the organization has appropriate staff to support activities related to the long-term preservation of the data.) The accumulated commitments of the repository can be identified in deposit agreements, service contracts, licenses, mission statements, work plans, priorities, goals, and objectives. Understaffing or a mismatch between commitments and staffing indicates that the repository cannot fulfill its agreements and requirements.
Evidence: Organizational charts; definitions of roles and responsibilities; comparison of staffing levels to commitments and estimates of required effort.
This is an interesting question for an organization like ICPSR. My colleague Nancy McGovern mentioned something the other day: She noted the difference between digital preservation (where ICPSR spends some of its resources, but not the lion's share) and data curation (where we spend a significant quantity of resources).
My sense is that the effort required to perform a base level of digital preservation on our content - plain text survey data and PDF- and XML-format documentation - is relatively small, and even if ICPSR found itself operating on a minimal budget without any of its topical archives or special projects, there would be an adequate number of staff to manage the archival holdings, review fixity reports, and execute migrations of content from format to format, or from location to location.
In our present configuration, we can see a close correlation between required effort and resources. This most often manifests itself as line-items in individual project budgets. But it also shows up on the organizational chart when one sees specific organizational units at ICPSR which have a clear digital preservation mission or component.
Staffing for the repository must be adequate for the scope and mission of the archiving program. The repository should be able to demonstrate an effort to determine the appropriate number and level of staff that corresponds to requirements and commitments. (These requirements are related to the core functionality covered by a certification process. Of particular interest to repository certification is whether the organization has appropriate staff to support activities related to the long-term preservation of the data.) The accumulated commitments of the repository can be identified in deposit agreements, service contracts, licenses, mission statements, work plans, priorities, goals, and objectives. Understaffing or a mismatch between commitments and staffing indicates that the repository cannot fulfill its agreements and requirements.
Evidence: Organizational charts; definitions of roles and responsibilities; comparison of staffing levels to commitments and estimates of required effort.
This is an interesting question for an organization like ICPSR. My colleague Nancy McGovern mentioned something the other day: She noted the difference between digital preservation (where ICPSR spends some of its resources, but not the lion's share) and data curation (where we spend a significant quantity of resources).
My sense is that the effort required to perform a base level of digital preservation on our content - plain text survey data and PDF- and XML-format documentation - is relatively small, and even if ICPSR found itself operating on a minimal budget without any of its topical archives or special projects, there would be an adequate number of staff to manage the archival holdings, review fixity reports, and execute migrations of content from format to format, or from location to location.
In our present configuration, we can see a close correlation between required effort and resources. This most often manifests itself as line-items in individual project budgets. But it also shows up on the organizational chart when one sees specific organizational units at ICPSR which have a clear digital preservation mission or component.
Labels:
archival storage,
digital preservation,
infrastructure,
trac
Wednesday, November 2, 2011
ICPSR Web availability through October 2011
Ick.
This is not a good trend.
Our overall availability (i.e., all components are working properly) sank below 99.5% again in October. The main culprit was a nearly two hour period on October 12, 2011 when a series of common alerts turned out to have an uncommon cause. The oncall systems engineer went through our usual series of steps to bring the service back online, and while the steps seemed to help at first, it was clear that the fix was just temporary, and more diagnostic work was necessary. This series of events also happened at an inopportune time, just as many of us were in transit between the office and home (and then back to the office again).
We also had a problem with our search engine technology (Solr) late in the month, and that contributed another 46 minutes to our unavailability. (Other components were working fine, but search was not.)
My apologies to those of you who were trying to get some work done on our site last month, and got bit by either of these problems.
This is not a good trend.
Our overall availability (i.e., all components are working properly) sank below 99.5% again in October. The main culprit was a nearly two hour period on October 12, 2011 when a series of common alerts turned out to have an uncommon cause. The oncall systems engineer went through our usual series of steps to bring the service back online, and while the steps seemed to help at first, it was clear that the fix was just temporary, and more diagnostic work was necessary. This series of events also happened at an inopportune time, just as many of us were in transit between the office and home (and then back to the office again).
We also had a problem with our search engine technology (Solr) late in the month, and that contributed another 46 minutes to our unavailability. (Other components were working fine, but search was not.)
My apologies to those of you who were trying to get some work done on our site last month, and got bit by either of these problems.
Monday, October 31, 2011
Tech@ICPSR returns from Gartner
Tech@ICPSR is still recovering from a week-long trip to the 2011 North American Gartner Symposium/ITxpo.
The event had nearly 10,000 technology leaders, CIOs, and other folks from the world of IT. The original game plan was to generate a few short blog posts about the event while attending the symposium, but no such luck. Kudos to those intrepid bloggers who can both attend an event all day and then blog about it all night.
The major themes at Gartner this year were: cloud, social, video, big data, and managing the explosive growth in the amount of content that is being produced and retained. ICPSR has its toes in all of these, and so it felt like the program was a very good fit.
In some ways the event was disappointing in that there were obvious take-aways. But in other ways it was rewarding because it confirmed that ICPSR is on the right track in many of these areas, such as making use of the cloud to solve certain problems. And so while the speaker, a Gartner Analyst, may have been encouraging the audience members to conduct an experiment with the cloud, ICPSR is already using the cloud as part of its production operations.
The sessions on "social" were the most interesting. Again, it feels like ICPSR is already making use of social media in some ways, but it also feels like there is a vast, untapped opportunity to use the content that "social media" generates in social science research. Is there an opportunity for ICPSR to partner with another organization to create, curate, preserve, and disseminate datasets derived from Twitter feeds? Facebook walls?
Photo credit: http://farm5.static.flickr.com/4038/5145453784_861aaacc04_m.jpg
The event had nearly 10,000 technology leaders, CIOs, and other folks from the world of IT. The original game plan was to generate a few short blog posts about the event while attending the symposium, but no such luck. Kudos to those intrepid bloggers who can both attend an event all day and then blog about it all night.
The major themes at Gartner this year were: cloud, social, video, big data, and managing the explosive growth in the amount of content that is being produced and retained. ICPSR has its toes in all of these, and so it felt like the program was a very good fit.
In some ways the event was disappointing in that there were obvious take-aways. But in other ways it was rewarding because it confirmed that ICPSR is on the right track in many of these areas, such as making use of the cloud to solve certain problems. And so while the speaker, a Gartner Analyst, may have been encouraging the audience members to conduct an experiment with the cloud, ICPSR is already using the cloud as part of its production operations.
The sessions on "social" were the most interesting. Again, it feels like ICPSR is already making use of social media in some ways, but it also feels like there is a vast, untapped opportunity to use the content that "social media" generates in social science research. Is there an opportunity for ICPSR to partner with another organization to create, curate, preserve, and disseminate datasets derived from Twitter feeds? Facebook walls?
Photo credit: http://farm5.static.flickr.com/4038/5145453784_861aaacc04_m.jpg
Friday, October 28, 2011
TRAC: A2.1: The right type of staff and skills
A2.1 Repository has identified and established the duties that it needs to perform and has
appointed staff with adequate skills and experience to fulfill these duties.
The repository must identify the competencies and skill sets required to operate the repository over time and demonstrate that the staff and consultants have the range of requisite skills—e.g., archival training, technical skills, and legal expertise.
Evidence: A staffing plan; competency definitions; job description; development plans; plus evidence that the repository review and maintains these documents as requirements evolve.
I see three main areas of evidence to support this requirement.
One is that ICPSR has been in operation for fifty years, and it continues to win contracts and grants to preserve and disseminate social science research data and documentation. No organization can operate for fifty years if it does not have a team capable of delivering success.
Another bit of evidence appears on ICPSR's organization chart. I don't believe we publish it for the world to see, but we do maintain a copy on our intranet site. The org chart shows the areas, teams, and people one needs to curate content successfully. Data managers? Check. Metadata specialists? Check. Technology? Check. Administrative functions? Check. And, of course, specialists in digital preservation policy and standards.
Finally, there is also evidence in the body of job descriptions one would find on the University of Michigan jobs site (if its content was preserved!). One can see how job titles, job functions, and skillsets have evolved over the years as technology, best practices, and types of content have also evolved.
The repository must identify the competencies and skill sets required to operate the repository over time and demonstrate that the staff and consultants have the range of requisite skills—e.g., archival training, technical skills, and legal expertise.
Evidence: A staffing plan; competency definitions; job description; development plans; plus evidence that the repository review and maintains these documents as requirements evolve.
I see three main areas of evidence to support this requirement.
One is that ICPSR has been in operation for fifty years, and it continues to win contracts and grants to preserve and disseminate social science research data and documentation. No organization can operate for fifty years if it does not have a team capable of delivering success.
Another bit of evidence appears on ICPSR's organization chart. I don't believe we publish it for the world to see, but we do maintain a copy on our intranet site. The org chart shows the areas, teams, and people one needs to curate content successfully. Data managers? Check. Metadata specialists? Check. Technology? Check. Administrative functions? Check. And, of course, specialists in digital preservation policy and standards.
Finally, there is also evidence in the body of job descriptions one would find on the University of Michigan jobs site (if its content was preserved!). One can see how job titles, job functions, and skillsets have evolved over the years as technology, best practices, and types of content have also evolved.
Labels:
archival storage,
digital preservation,
infrastructure,
trac
Wednesday, October 26, 2011
Using DuraCloud for Archiving and Preservation
I'll be joining Michele Kimpton, CEO of DuraSpace, on a webinar next Wednesday (November 2, 2011). Our topic is DuraCloud, and how one can use this cloud-based service as part of one's digital preservation strategy.
I think sometimes people will view the cloud as an alternative to keeping and maintaining local copies, but at ICPSR we're using the cloud as an easy-to-manage storage location to supplement more conventional locations, such as local NAS storage and the University of Michigan's "Value Storage" service.
Here is a copy of the invite that went out via email:
I think sometimes people will view the cloud as an alternative to keeping and maintaining local copies, but at ICPSR we're using the cloud as an easy-to-manage storage location to supplement more conventional locations, such as local NAS storage and the University of Michigan's "Value Storage" service.
Here is a copy of the invite that went out via email:
| |
|
Friday, October 21, 2011
TRAC: A1.2: Succession planning
A1.2 Repository has an appropriate, formal succession plan, contingency plans, and/or
escrow arrangements in place in case the repository ceases to operate or the governing or
funding institution substantially changes its scope.
Part of the repository’s perpetual-care promise is a commitment to identify appropriate successors or arrangements should the need arise. Consideration needs to be given to this responsibility while the repository or data is viable—not when a crisis occurs—to avoid irreparable loss. Organizationally, the data in a repository can be at risk regardless of whether the repository is run by a commercial organization or a government entity (national library or archives). At government-managed repositories and archives, a change in government that significantly alters the funding, mission, collecting scope, or staffing of the institution may put the data at risk. These risks are similar to those faced by commercial and researchbased repositories and should minimally be addressed by succession plans for significant collections within the greater repository.
A formal succession plan should include the identification of trusted inheritors, if applicable, and the return of digital objects to depositors with adequate prior notification, etc. If a formal succession plan is not in place, the repository should be able to point to indicators that would form the basis of a plan, e.g., partners, commitment statements, likely heirs. Succession plans need not specify handoff of entire repository to a single organization if this is not feasible. Multiple inheritors are possible so long as the data remains accessible.
Evidence: Succession plan(s); escrow plan(s); explicit and specific statement documenting the intent to ensure continuity of the repository, and the steps taken and to be taken to ensure continuity; formal documents describing exit strategies and contingency plans; depositor agreements.
ICPSR has formalized succession planning and contingency planning via its commitment to Data-PASS, the Data Preservation Alliance for the Social Sciences. As noted on the Data-PASS web portal:
This commitment helps ensure that content currently held and managed by ICPSR will continue to be available even if ICPSR ceases to exist.
Part of the repository’s perpetual-care promise is a commitment to identify appropriate successors or arrangements should the need arise. Consideration needs to be given to this responsibility while the repository or data is viable—not when a crisis occurs—to avoid irreparable loss. Organizationally, the data in a repository can be at risk regardless of whether the repository is run by a commercial organization or a government entity (national library or archives). At government-managed repositories and archives, a change in government that significantly alters the funding, mission, collecting scope, or staffing of the institution may put the data at risk. These risks are similar to those faced by commercial and researchbased repositories and should minimally be addressed by succession plans for significant collections within the greater repository.
A formal succession plan should include the identification of trusted inheritors, if applicable, and the return of digital objects to depositors with adequate prior notification, etc. If a formal succession plan is not in place, the repository should be able to point to indicators that would form the basis of a plan, e.g., partners, commitment statements, likely heirs. Succession plans need not specify handoff of entire repository to a single organization if this is not feasible. Multiple inheritors are possible so long as the data remains accessible.
Evidence: Succession plan(s); escrow plan(s); explicit and specific statement documenting the intent to ensure continuity of the repository, and the steps taken and to be taken to ensure continuity; formal documents describing exit strategies and contingency plans; depositor agreements.
ICPSR has formalized succession planning and contingency planning via its commitment to Data-PASS, the Data Preservation Alliance for the Social Sciences. As noted on the Data-PASS web portal:
Organizations join the Data-PASS partnership for several reasons. Membership in Data-PASS helps insure against preservation loss. Data-PASS safeguards the collections of its members through transfer protocols, succession planning, and live replication of collections. If a member organization requires off-site replication of its collections, the partnership will provide it. And if a member organization is no longer institutionally capable of preserving and disseminating a collection, the collection can be preserved and disseminated through the partnership.
This commitment helps ensure that content currently held and managed by ICPSR will continue to be available even if ICPSR ceases to exist.
Labels:
archival storage,
digital preservation,
infrastructure,
trac
Location:
330 Packard St, Ann Arbor, MI 48104, USA
Subscribe to:
Posts (Atom)