Hi Jörg, On Apr 25, 2012, at 10:27 AM, Joerg Ehrlich wrote: > >> I am not strongly supportive of of changing the HashMap internal >> representation in Metadata out. >> A couple of things I like about the HashMap: >> >> * It's simple. >> * It doesn't require dependency on any external libraries and helps keep >> tika-core minimal. >> >> Wouldn't it be possible for example to simply have XMP be something that >> sits on top of the Metadata object? > > There are definitely a lot of different ways we could implement the metadata > handling in Tika. > But having XMP as the underlying data model of whatever implementation we are > going to choose, has the following rationale from my perspective: > > Right now Tika is just providing a limited, common set of metadata for all > supported file formats. We already said that this is fine and should stay > that way, which also means clients can continue to use the current simple > API. But there are clients and use cases which would like to have access to > more than the currently supported limited set of metadata and also have > semantic information travelling with it (i.e. Namespace information).
Agreed, that's fine by me too, I'm +1 to support those clients. > One example for extended metadata interest are video workflows where you see > more and more temporal metadata being used which is quite structured and > complex (compared to a simple property like a title). The same is true for > face recognition metadata that all current image applications (and also > camera devices) are already writing into the assets. > An example for the importance of semantic information: At Adobe we already > have to worry about something as simple as the "creation date". Because it > could be the date the asset has been written to the hard disk I am currently > looking at, it could be date the original creator has written it on his hard > disk, it could also be the date the art work has been digitized (i.e. > scanned) or it could be the date the work shown on the digital image has been > created. Namespaces provide that information. Yep agreed. We have the same issues at NASA too for all sorts of planetary, Earth science, astrophysics, and other data :) Metadata is super important, and Tika's support for it is definitely basic at best. It needs to be improved. > Oh, and copyright information is also pretty sensitive when it comes to > semantics :) > A namespace registry as it is provided by the XMP library is in this case > pretty handy, because storing information in just prefixes is easy, but also > dangerous as they are just variables. > > I would argue that it is difficult to store such data faithfully with a > simple Hashmap. And having two data models storing data is pretty error prone. I'm not convinced that it's difficult to store data faithfully in a hash map. You can encode all sorts of information in field keys (including namespaces). We discussed this a long time ago in Tika (I think Bertrand reported it when it we were in the Incubator): https://issues.apache.org/jira/browse/TIKA-61 The discussion then was that XMP would be something that we could use to help drive it, but I'm just saying I don't think it's the HashMap that's the limitation here. Why couldn't we simply add a new module at the tika-* level, called tika-xmp? At the very least, it would be the least intrusive way of exploring some of these ideas. > > The XMP library would add a dependency and size to Tika-Core but it is really > just the data model and a parser/serializer for the XML/RDF, so the footprint > is small. > Additionally if you want to provide XMP output from Tika you need to have > something like the XMP library to manage and serialize the data, because it > would be too painful to write a decent XML/RDF serializer again. I'm +1 to try out some of these ideas, but think that doing it in a tika-xmp module at the same level as e.g., tika-core, tika-server, etc., might be less intrusive. Thanks for discussing this with me. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++