Monday, October 25, 2010
DuraCloud pilot update - October 2010
The collection we selected for pilot purposes is a subset of our archival storage that contains preservation copies of our public-use datasets and documentation. This is big but not too big (1TB or so), and contains a nice mix of format types, such as plain text, XML, PDF, TIFF, and more. At the time of this post, we have 72,134 files copied into DuraCloud.
I've been using their Java-based command-line utility called the synctool to synchronize some of our content with DuraCloud. I found it useful to wrap the utility in a small shell script so that I do not need to specify as many command-line arguments when I invoke it. I tend to use sixteen threads to synchronize content rather than the default three, and while that places a heavy load on our machine here, it leads to faster synchronization. The synctool assumes an interactive user, and has a very basic interface for checking status.
Overall I like the synctool but wish that it had an option that did not assume an interactive user; something I could run out of cron like I often do with rsync. Because the underlying storage platform (S3) limits the size of files, synctool is not able to copy some of our larger files. I wish synctool would "chunk up" the files into more manageable pieces, and sync them for me. One reason I don't use raw S3 for storage is because of this file size limitation; instead I like to spend a little more money and attach an Elastic Block Storage volume (S3-backed) to a running instance, and then use the filesystem to hide the limitation. Then I can just use standard tools, like rsync, to copy very large files into the cloud.
The DuraCloud folks have been great collaborators: extremely responsive, extremely helpful; just a joy to work with. They've told me about a pair of upcoming features that I'm keen to test.
One, their fixity service will be revamped in the 0.7 release. It'll have fewer options and features, but will be much easier to use. I'm eager to see how this compares to a low-tech approach I use for our archival storage: weekly filesystem scans + MD5 calculations compared to values stored in a database.
Two, their replicate-on-demand service is coming, and ICPSR will be the first (I think) test case to replicate its content from S3 to Azure's storage service. I have not had the opportunity to use Microsoft's cloud services at all, and am looking forward to seeing how it performs.