Re: [metadata] roadmap proposal available on the wiki

Mattmann, Chris A (388J) Thu, 26 Apr 2012 10:26:53 -0700

Hi Guys,

One comment RE: the below too -- this is precisely where I see
Any23 coming into play and why there is a strong relationship
between it and Tika:


http://incubator.apache.org/any23/

I'm the current Champion for the project and the Tika PMC is 
sponsoring the podling. Please check it out b/c some of the below
may make sense to go into Any23, a downstream Tika consumer.

Thanks!

Cheers,
Chris

On Apr 26, 2012, at 10:13 AM, Joerg Ehrlich wrote:

> Yes, that is exactly my biggest concern. 
> Another nice example is regional metadata like from a face detection (taken 
> from MWG guidance V2):
> <mwg-rs:Regions rdf:parseType="Resource">
>      <mwg-rs:AppliedToDimensions stDim:w="4288" stDim:h="2848" 
> stDim:unit="pixel"/>
>      <mwg-rs:RegionList>
>        <rdf:Bag>
>          <rdf:li rdf:parseType="Resource">
>            <mwg-rs:Area stArea:x="0.5" stArea:y="0.5" stArea:w="0.06" 
> stArea:h="0.09" stArea:unit="normalized"/>
>            <mwg-rs:Type>Face</mwg-rs:Type>
>            <mwg-rs:Title>John Doe</mwg-rs:Title>
>          </rdf:li>
>       ...
> 
> And I also definitely meant to keep the current metadata class API, while 
> doing a best-guess mapping to the internal structural data representation 
> which would at least work pretty well for the common set of properties.
> 
> But as Chris said, let's get started with step 1 and then for step 2 we can 
> start with an extra XMP module for the XMP output. I will update the wiki 
> tomorrow.
> Thanks for taking the time to discuss this.
> 
> Regards
> Jörg
> 
> 
> -----Original Message-----
> From: Ray Gauss II [mailto:ray.ga...@alfresco.com] 
> Sent: Donnerstag, 26. April 2012 18:03
> To: dev@tika.apache.org
> Subject: Re: [metadata] roadmap proposal available on the wiki
> 
> I think besides the namespaces, one of the issues Jörg is trying to tackle is 
> the structured metadata and the extra time and effort referred to is dealing 
> with serialization of structured data to and from a hashmap.
> 
> For example I may have metadata similar to:
> 
> Contact1
> |-- First Name
> |-- Last Name
> |-- Email
> |-- Address
>    |-- Street
>    |-- City
>    ...
> Contact 2
> |-- First Name
> |-- Last Name
> |-- Email
> |-- Address
>    |-- Street
>    |-- City
>    ...
> 
> which could be modeled in a HashMap<String, String[]>, but would be better 
> handled by a structured store, be that XMP or something else.
> 
> We could consider replacing the underlying Hashmap in Metadata with a 
> structured store while still leaving methods like:
>   public String Metadata.get(Property property) intact but could then make a 
> best guess when the requested property is within a structure, then add 
> methods like:
>   public Object Metadata.getStructured(Property property) when a user wants 
> the entire structured object.
> 
> That approach should be able to maintain backwards compatibility for existing 
> implementations and allow for structured and namespaced metadata.
> 
> Just a thought,
> 
> Ray
> 
> 
> On Apr 26, 2012, at 11:37 AM, Mattmann, Chris A (388J) wrote:
> 
>> Hi Jörg,
>> 
>> Thanks for your email, comments below:
>> 
>> On Apr 26, 2012, at 3:35 AM, Joerg Ehrlich wrote:
>> 
>>> Hi Chris,
>>> 
>>> Those are all valid points and I agree that you could do everything with a 
>>> Hashmap. 
>>> Having the parsers fill the Metadata class and its Hashmap with all needed 
>>> information which is then consumed by an XMP component sitting on top of 
>>> Tika-Core is definitely an interesting solution which would keep Tika-Core 
>>> clean of any dependencies and give the ability to introduce new XMP related 
>>> APIs in a least intrusive way.
>>> But from my point of view it is also about how much time and effort you 
>>> would like to spend implementing and testing code in the Metadata class 
>>> when you have something tested and stable that is already available for 
>>> exactly that purpose. 
>> 
>> Well I think our Metadata object is fairly well tested and implemented 
>> atm, so I'm not sure what extra time and effort we're talking about 
>> here? The only extra time and effort I see is in adding this XMP extension 
>> to it.
>> 
>>> Another thought that just comes to my mind is that a lot of file formats 
>>> already use XMP as one or even the only metadata container and you would 
>>> then end up filling the metadata map with the data from the file's XMP and 
>>> converting it back to XMP later on, compared to just being able to parse it 
>>> as is and having most of the metadata available right away. 
>> 
>> Yep in tika-xmp (new module) this might be less efficient, but it will 
>> maintain a lot of familiarity with folks who are used to maintaining the 
>> existing Metadata object internals and models/etc.
>> 
>> Anyways, feel free to push forward, I am just letting you know I am 
>> against changing the internals of the Metadata model, at least at the 
>> moment :) At the same time your enthusiasm is great and all I can say 
>> is you are doing great and push forward and we'll see where we get...
>> 
>> Cheers,
>> Chris
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: [metadata] roadmap proposal available on the wiki

Reply via email to