RE: Metadata situation and XMP support in Tika

Joerg Ehrlich Tue, 24 Apr 2012 09:03:21 -0700

Ups, forgot the links...

-----Original Message-----

Hi Nick,

Yes, I agree that Tika should support a unifying access to common metadata 
properties like title, description, keywords, creator, rating, etc. So there 
should be a clear semantic for those common properties regardless of the 
underlying implementation in various metadata containers. And the access to 
these properties can be or should be as simple as "Metadata.title".
On the other hand, if you think about Tika being used in business workflow 
where clients really care about the underlying semantic and file format 
specific metadata, you might need something more powerful and flexible to 
access and manage metadata. 
And I also agree that the latter should be possible without sacrificing the 
first. 

On a side note:
While the idea of "Someone who understands the format works out how to map the 
file format's metadata onto a common set" is very compelling and is easy to do, 
in reality this can get very complicated. And if people have big business 
depending on such mappings, they tend to have different opinions about what the 
right way is. That's why we have organizations like the "Metadata Working 
Group" [1] or the W3C "Media Annotation Working Group" [2] trying to clean up 
the mess that has evolved over the last decades in this area.
And the moment you start writing metadata back into files, you will also start 
running in all sorts of complications when you have done too much 
simplification in the read case. But that is no problem for Tika, right now. 

I agree with Ray that the current implementation can support both approaches to 
make metadata accessible.
While the metadata map can be used to offer easy access to the common set of 
properties, an XMP output could be used to offer a more extensive, flexible and 
semantically clearer access to a file's metadata.
I agree with Ray that the common set of keys in the Metadata map should 
inherit/alias from well known, standard namespaces like Dublin Core. That's why 
I said the Tika parsers should read metadata using the standard namespaces and 
properties. This would also make the mapping in the parsers more clearer for 
developers that want to change something. Currently you always have to guess 
where something is mapped to.
In general, I'd recommend Dublin Core and the semantic of the ISO part of XMP - 
which builds on top DC - for common and file format neutral Tika properties 
that are offered to clients.
And I agree with Ray that having all metadata interfaces be part of the 
Metadata class is more confusing than helpful for clients.

I am about to put an architectural metadata roadmap on the Tika Wiki for 
further discussion.
There I want to illustrate a couple of ideas I have also been discussing with 
Jukka so far and the steps we see on a roadmap that should help us to improve 
the metadata situation for Tika.

Regards
Jörg 

[1] http://metadataworkinggroup.com/specs/
[2] http://www.w3.org/TR/2012/REC-mediaont-10-20120209/

-----Original Message-----
From: Ray Gauss II [mailto:ray.ga...@alfresco.com]
Sent: Dienstag, 24. April 2012 15:10
To: dev@tika.apache.org
Subject: Re: Metadata situation and XMP support in Tika

I think the aliasing approach supports both use cases nicely, i.e.:

Metatadata.java:
...
   Property TITLE = DublinCore.DC_TITLE; ...

Users then only have to concern themselves with "give me the metadata that best 
fits the idea of Title, as defined by Tika", and not even have to know about 
DublinCore, but can dig into details of the implementation as needed.

This separation is less of a concern in the particular case of DublinCore since 
it is such as basic, broad, and widely accepted standard, but for other 
standards that direct inclusion in the Metadata interface makes less sense.  
For example, at the moment we're essentially asking users to say "give me the 
metadata that best fits the idea of Keywords, as defined by MSOffice" which 
doesn't make a lot of sense when dealing with something like images.  If we 
aliased:

Metatadata.java:
...
   Property KEYWORDS = MSOffice.MS_KEYWORDS; ...

we're back to the intended "give me the metadata that best fits the idea of 
Keywords, as defined by Tika".  In this case, DublinCore.DC_SUBJECT is probably 
a much better standard to alias keywords from than MSOffice, but I'm just 
sticking to the current mappings for this example.

Ray

On Apr 24, 2012, at 7:43 AM, Nick Burch wrote:

> On Fri, 13 Apr 2012, Joerg Ehrlich wrote:
>> I think it would be more clear if parsers/clients would use the namespace or 
>> standard properties explicitly instead of the metadata one. But your idea of 
>> having a set of "standard" properties available in the Metadata class would 
>> be a good help for clients who don't care which "title" or "author" they 
>> read. They could just say "Metadata.title" instead of "DublinCore.title".
> 
> One thing to bear in mind is that we've tried to hide the differences in 
> format's metadata from end users of Tika. You shouldn't need to know if a 
> format calls it "description" or "subject" or "title" or "dc:title" or 
> "WhatItsAllAbout". Someone who understands the format works out how to map 
> the file format's metadata onto a common set. End users can then say "give me 
> the metadata that best fits the idea of Title, as defined by Dublin Core" and 
> they get something back. The intricacies of the file formats are hidden from 
> them, they get clean and consistent metadata back.
> 
> I certainly see there are cases when someone may want the full set of 
> metadata back from a file, in quite a low level way, but we should 
> make sure we don't loose the ability of users to say "give me the 
> title of that document, no matter what the format stores it as" that 
> we currently have
> 
> Nick

RE: Metadata situation and XMP support in Tika

Reply via email to