Hi Jörg, Great summary! I would be in favor of option #2 as well, with the caveat that if we take it slow, I think there might be a way to not really have as much of a client/API impact, using deprecations and other techniques as you suggested.
Looking forward to your participation! Cheers, Chris On Apr 5, 2012, at 5:58 AM, Joerg Ehrlich wrote: > Hi everyone, > > I am an engineer in the XMP/Metadata team at Adobe and we would like to > leverage Tika in current projects for metadata extraction (and mimetype > detection). > Our current systems primarily use the XMP data model to manage and interact > with metadata. > As far as I can see, the support for the XMP data model and also for standard > metadata schemas/namespaces (like IPTC, Exif, etc.) in Tika is pretty > suboptimal as of today. > But instead of wrapping Tika in own layers of code in our systems, we feel > that it would be more useful to contribute to the project instead going > forward. > > I have had a deeper look in Tika and how to improve the metadata/XMP output > of it. > I saw that you have a bug for XMP already (TIKA-756), which I would probably > use to submit any patches related to that. > But I am currently unsure what the best approach would be to do the mapping > to XMP and I would like to hear your opinion on it before starting any work. > > Let me quickly summarize if I have understood the basic metadata concept > correctly: > > 1. Each parser fills a Metadata map which is a simple key-value list > where values can also be multi-values > > 2. Mostly the keys for the Metadata map are taken from fixed lists > which are defined as interfaces in the Metadata class > > 3. Those keys are usually Property objects, where the Property class > also serves as a static list which registers every property that is created > in the Metadata interfaces. This Property class resembles the XMP data model > to some extend but does not store e.g. any hierarchical information. And it > leaves every client the choice to store property names with prefixes or not. > > 4. Any metadata outputter just iterates over the Metadata map and could > query the Property list for additional information. > > 5. In case of the XMP outputter (XMPContentHandler) only those > properties are outputted which are stored with a prefix in the Property list. > > > I see two potential ways to improve the situation: > > > 1. Have a fixed mapping table for each mime type which would be used in > XMPContentHandler to map from the Metadata map to the XMP data model. Such > mapping tables would be pretty ugly as each parser produces different > metadata maps and there is no consistent way to handle them. This option > would be least invasive for other clients of Tika but would also be a real > hack and would not really improve the metadata situation in Tika in general. > > 2. Try to improve the Key interface lists of Metadata class and adjust > all parsers accordingly. This could be done by adding new keys with prefixes > and keeping/deprecating the existing ones to not disturb existing clients. > Similar to what is proposed for the DublinCore namespace in TIKA-859 and > TIKA-842. > This would be more invasive but would offer the opportunity to really improve > the metadata situation. I already saw a couple of places in the code that > clearly break existing standards. But there are also examples where mapping > might have to be done to different properties at the same time: If you look > at the mapping of GPS data from Exif, this is currently mapped to W3C > vocabulary in Tika. But in XMP this mapping is defined otherwise [1] by CIPA > (the EXIF standardization committee). So probably both mappings have to be > supported. > > I personally would prefer option two. What do you think? > Looking forward to working with you guys. > Regards > Jörg > > [1] http://www.cipa.jp/english/hyoujunka/kikaku/cipa_e_kikaku_list.html > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++