Friday, May 4, 2007

format identification

AONS version 2, like the first version will be utilising Droid and JHove for format identification. There is a bit of a catch 22 with regard to format identification tools and obsolescence/risk:

  1. People have, as far as I am aware, mainly written identifiers for well used formats
  2. Formats which are most "at risk" will be lesser used and probably won't have identifiers
So, there is a large chance that neither Droid or JHove will be able to identify important, yet lesser used, formats. If there is a large chance that important yet obscure files in a repository cannot be identified, we probably should ensure that the result of a repository scan includes the location, number and all known signifiers (maybe extensions?) of these unidentified files. A very useful application (maybe AONS 2.X or maybe AONS 3.X) would also allow a user to view singular unidentified files.

Now knowing that our repository crawl will give a metric for unidentified files, we should ensure that this metric is included when assessing the risk of obsolete documents within a repository - a repository with a high number of unidentified files should factor that in when viewing the obsolescence information.

Also - I'm a little worried about JHove/Droid's reliability in file format identification - looking at this post on the DSpace wiki makes me wonder whether extension based identification would be imperfect yet doable solution.

I think what I'll do is a bit of a combination approach:
  1. Crawler gets a format resource as well as basic metadata about it (like location, full name, extension, mime type etc)
  2. We have a list of format identifiers which will, for each, attempt to further identify the file (maybe determine version?) by using both the previous metadata and the actual file itself
  3. Once all "identifiers" have run against the file, the final metadata should be at least as good as it was when we began the identification
  4. With this potentially improved metadata in hand, we'll try and look for internal formats (those in the deployed AONS system) which match
Should a match be found between a repository format resource and an internal AONS format, we'll be able to link in the obsolescence information we have for that format. In the case where a link is not made, we will give the unidentified object, and any which have the same metadata the highest risk assessment possible.

I'm also going to try put in place relatively easy methods for repository owners to create basic mappings between extensions and mime types so that niche formats are easily identified.

0 comments: