Wednesday, June 8, 2011

New technology for our Research Connections property


Most of the new technology is behind the scenes, but ICPSR deployed a major update to the content and metadata curation system that sits behind our Research Connections web site this Monday.

The old curation system used an older search technology that ICPSR never really embraced:  Oracle Text Search.  Our experience with this technology was very negative.  My sense is that when we first began using the technology in 2005-2006, it was not stable, and had not been tested rigorously across the most common platforms.  We found it difficult to open trouble tickets and cases with Oracle, and even when we were successful at that, we found that they were slow to provide a fix.

We replaced most of our Oracle Text Search in 2009 and 2010, moving to the Lucene search engine from the Apache project.  Our experience there was very different, and somewhat ironic:  the level of support and quality of software was much higher for "unowned" open source software than it was for a commercial product from a vendor.  And now we have been able to replace the search we use in the curation system with Lucene too.

Moving to Lucene also allowed us to decommission a large corpus of kludges we had put in place to make Oracle Text Search to work.  For example, we found that the Oracle Text Search parser did not do a very good job indexing PDF-format documents; it would silently fail, and so our index was never complete and correct.  So we built a system - it could have been designed by Rube Goldberg himself - which continually watched for new PDF documents to appear, converted them to text, updated the Oracle Text Search index, checked the index for correctness, and then moved the new index into production.  And then it started the cycle again.  No one will miss this piece of software.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.