Wednesday, April 18, 2012

Great FLAMEing file identification service

Some parts of the FLAME project will lend themselves to a microservices approach.  Microservices, like cloud computing, is a trendy, useful concept, but without a crystal clear definition.  But my take is that a microservice is something that performs one small, but useful bit of work, and which can be swapped in and out of an overall architecture at a component level.  It needs to have very clear inputs and outputs, and cannot contain any "secret sauce" that isn't part of its functional role.

Do not try this street magic at home.
One common activity at ICPSR is automated file identification.  Historically we've done this with the venerable UNIX utility file, but where we modify the magic database heavily, particularly for the formats we see most often.  We also post-process the output from file where we need additional handling above and beyond the capabilities of the magic database (e.g., making decisions based on the name or extension of the file).

Managing the magic database is not for the faint of heart.  (Try updating the Vorbis section.)  And this management has gotten both harder -- RHEL 6 uses a new format for its magic database which is incompatible with RHEL 5 -- and easier -- the new format eliminates the pesky magic.mime database.  However, we've gotten reasonably competent at managing magic and have come to rely on it for file format identification.

In support of the FLAME project we even created a little web service that takes a file's content and its name as input, and delivers a little snippet of XML as the output.  The XML contains the "human readable" answer from our magic database and the "MIME type" too.  This is our first FLAME-inspired web service.

If you'd like to try it, you can use your favorite form-capable URL transfer utility to do so.  Here's an example where I have run curl on one of our RHEL machines:


dhcp-bryan:; curl -F "file=@uuid-comparison.xlsx;filename=uuid-comparison.xlsx" www.icpsr.umich.edu/cgi-bin/wsifile<?xml version="1.0" encoding="utf-8"?><wsifile><ifile>Microsoft Excel</ifile><ifilemime>application/zip; charset=binary</ifilemime><uploadInfo>application/octet-stream</uploadInfo></wsifile>

feeding in an Excel file as the input, and another with a plain text file:

dhcp-bryan:; curl -F "file=@/etc/resolv.conf;filename=resolv.conf" www.icpsr.umich.edu/cgi-bin/wsifile<?xml version="1.0" encoding="utf-8"?><wsifile><ifile>ASCII text</ifile><ifilemime>text/plain; charset=us-ascii</ifilemime><uploadInfo>application/octet-stream</uploadInfo></wsifile>

and an interesting MS Word file:

dhcp-bryan:; curl -F "file=@2011-03CouncilPandAminutes.doc;filename=2011-03CouncilPandAminutes.doc" www.icpsr.umich.edu/cgi-bin/wsifile<?xml version="1.0" encoding="utf-8"?><wsifile><ifile>CDF V2 Document, Little Endian, Os: Windows, Version 5.1, Code page: 1200, Number of Characters: 0, Name of Creating Application: Aspose.Words for Java 4.0.3.0, Number of Pages: 1, Revision Number: 1, Security: 0, Template: Normal.dot, Number of Words: 0</ifile><ifilemime>application/msword; charset=binary</ifilemime><uploadInfo>application/octet-stream</uploadInfo></wsifile>

Feel free to try it out, and to post reactions, suggestions here.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.