Re: Metadata situation and XMP support in Tika

Mattmann, Chris A (388J) Thu, 05 Apr 2012 07:21:03 -0700

Hi Jörg,

Great summary! I would be in favor of option #2 as well, with the caveat that 
if we take it slow, I think there might be a way to 
not really have as much of a client/API impact, using deprecations and other 
techniques as you suggested.


Looking forward to your participation!

Cheers,
Chris

On Apr 5, 2012, at 5:58 AM, Joerg Ehrlich wrote:

> Hi everyone,
> 
> I am an engineer in the XMP/Metadata team at Adobe and we would like to 
> leverage Tika in current projects for metadata extraction (and mimetype 
> detection).
> Our current systems primarily use the XMP data model to manage and interact 
> with metadata.
> As far as I can see, the support for the XMP data model and also for standard 
> metadata schemas/namespaces (like IPTC, Exif, etc.) in Tika is pretty 
> suboptimal as of today.
> But instead of wrapping Tika in own layers of code in our systems, we feel 
> that it would be more useful to contribute to the project instead going 
> forward.
> 
> I have had a deeper look in Tika and how to improve the metadata/XMP output 
> of it.
> I saw that you have a bug for XMP already (TIKA-756), which I would probably 
> use to submit any patches related to that.
> But I am currently unsure what the best approach would be to do the mapping 
> to XMP and I would like to hear your opinion on it before starting any work.
> 
> Let me quickly summarize if I have understood the basic metadata concept 
> correctly:
> 
> 1.       Each parser fills a Metadata map which is a simple key-value list 
> where values can also be multi-values
> 
> 2.       Mostly the keys for the Metadata map are taken from fixed lists 
> which are defined as interfaces in the Metadata class
> 
> 3.       Those keys are usually Property objects, where the Property class 
> also serves as a static list which registers every property that is created 
> in the Metadata interfaces. This Property class resembles the XMP data model 
> to some extend but does not store e.g. any hierarchical information. And it 
> leaves every client the choice to store property names with prefixes or not.
> 
> 4.       Any metadata outputter just iterates over the Metadata map and could 
> query the Property list for additional information.
> 
> 5.       In case of the XMP outputter (XMPContentHandler) only those 
> properties are outputted which are stored with a prefix in the Property list.
> 
> 
> I see two potential ways to improve the situation:
> 
> 
> 1.       Have a fixed mapping table for each mime type which would be used in 
> XMPContentHandler to map from the Metadata map to the XMP data model. Such 
> mapping tables would be pretty ugly as each parser produces different 
> metadata maps and there is no consistent way to handle them. This option 
> would be least invasive for other clients of Tika but would also be a real 
> hack and would not really improve the metadata situation in Tika in general.
> 
> 2.       Try to improve the Key interface lists of Metadata class and adjust 
> all parsers accordingly. This could be done by adding new keys with prefixes 
> and keeping/deprecating the existing ones to not disturb existing clients. 
> Similar to what is proposed for the DublinCore namespace in TIKA-859 and 
> TIKA-842.
> This would be more invasive but would offer the opportunity to really improve 
> the metadata situation. I already saw a couple of places in the code that 
> clearly break existing standards. But there are also examples where mapping 
> might have to be done to different properties at the same time: If you look 
> at the mapping of GPS data from Exif, this is currently mapped to W3C 
> vocabulary in Tika. But in XMP this mapping is defined otherwise [1] by CIPA 
> (the EXIF standardization committee). So probably both mappings have to be 
> supported.
> 
> I personally would prefer option two. What do you think?
> Looking forward to working with you guys.
> Regards
> Jörg
> 
> [1] http://www.cipa.jp/english/hyoujunka/kikaku/cipa_e_kikaku_list.html
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Metadata situation and XMP support in Tika

Reply via email to