format identification
AONS version 2, like the first version will be utilising Droid and JHove for format identification. There is a bit of a catch 22 with regard to format identification tools and obsolescence/risk:
- People have, as far as I am aware, mainly written identifiers for well used formats
- Formats which are most "at risk" will be lesser used and probably won't have identifiers
Now knowing that our repository crawl will give a metric for unidentified files, we should ensure that this metric is included when assessing the risk of obsolete documents within a repository - a repository with a high number of unidentified files should factor that in when viewing the obsolescence information.
Also - I'm a little worried about JHove/Droid's reliability in file format identification - looking at this post on the DSpace wiki makes me wonder whether extension based identification would be imperfect yet doable solution.
I think what I'll do is a bit of a combination approach:
- Crawler gets a format resource as well as basic metadata about it (like location, full name, extension, mime type etc)
- We have a list of format identifiers which will, for each, attempt to further identify the file (maybe determine version?) by using both the previous metadata and the actual file itself
- Once all "identifiers" have run against the file, the final metadata should be at least as good as it was when we began the identification
- With this potentially improved metadata in hand, we'll try and look for internal formats (those in the deployed AONS system) which match
I'm also going to try put in place relatively easy methods for repository owners to create basic mappings between extensions and mime types so that niche formats are easily identified.
0 comments:
Post a Comment