Hi Guys, One comment RE: the below too -- this is precisely where I see Any23 coming into play and why there is a strong relationship between it and Tika:
http://incubator.apache.org/any23/ I'm the current Champion for the project and the Tika PMC is sponsoring the podling. Please check it out b/c some of the below may make sense to go into Any23, a downstream Tika consumer. Thanks! Cheers, Chris On Apr 26, 2012, at 10:13 AM, Joerg Ehrlich wrote: > Yes, that is exactly my biggest concern. > Another nice example is regional metadata like from a face detection (taken > from MWG guidance V2): > <mwg-rs:Regions rdf:parseType="Resource"> > <mwg-rs:AppliedToDimensions stDim:w="4288" stDim:h="2848" > stDim:unit="pixel"/> > <mwg-rs:RegionList> > <rdf:Bag> > <rdf:li rdf:parseType="Resource"> > <mwg-rs:Area stArea:x="0.5" stArea:y="0.5" stArea:w="0.06" > stArea:h="0.09" stArea:unit="normalized"/> > <mwg-rs:Type>Face</mwg-rs:Type> > <mwg-rs:Title>John Doe</mwg-rs:Title> > </rdf:li> > ... > > And I also definitely meant to keep the current metadata class API, while > doing a best-guess mapping to the internal structural data representation > which would at least work pretty well for the common set of properties. > > But as Chris said, let's get started with step 1 and then for step 2 we can > start with an extra XMP module for the XMP output. I will update the wiki > tomorrow. > Thanks for taking the time to discuss this. > > Regards > Jörg > > > -----Original Message----- > From: Ray Gauss II [mailto:ray.ga...@alfresco.com] > Sent: Donnerstag, 26. April 2012 18:03 > To: dev@tika.apache.org > Subject: Re: [metadata] roadmap proposal available on the wiki > > I think besides the namespaces, one of the issues Jörg is trying to tackle is > the structured metadata and the extra time and effort referred to is dealing > with serialization of structured data to and from a hashmap. > > For example I may have metadata similar to: > > Contact1 > |-- First Name > |-- Last Name > |-- Email > |-- Address > |-- Street > |-- City > ... > Contact 2 > |-- First Name > |-- Last Name > |-- Email > |-- Address > |-- Street > |-- City > ... > > which could be modeled in a HashMap<String, String[]>, but would be better > handled by a structured store, be that XMP or something else. > > We could consider replacing the underlying Hashmap in Metadata with a > structured store while still leaving methods like: > public String Metadata.get(Property property) intact but could then make a > best guess when the requested property is within a structure, then add > methods like: > public Object Metadata.getStructured(Property property) when a user wants > the entire structured object. > > That approach should be able to maintain backwards compatibility for existing > implementations and allow for structured and namespaced metadata. > > Just a thought, > > Ray > > > On Apr 26, 2012, at 11:37 AM, Mattmann, Chris A (388J) wrote: > >> Hi Jörg, >> >> Thanks for your email, comments below: >> >> On Apr 26, 2012, at 3:35 AM, Joerg Ehrlich wrote: >> >>> Hi Chris, >>> >>> Those are all valid points and I agree that you could do everything with a >>> Hashmap. >>> Having the parsers fill the Metadata class and its Hashmap with all needed >>> information which is then consumed by an XMP component sitting on top of >>> Tika-Core is definitely an interesting solution which would keep Tika-Core >>> clean of any dependencies and give the ability to introduce new XMP related >>> APIs in a least intrusive way. >>> But from my point of view it is also about how much time and effort you >>> would like to spend implementing and testing code in the Metadata class >>> when you have something tested and stable that is already available for >>> exactly that purpose. >> >> Well I think our Metadata object is fairly well tested and implemented >> atm, so I'm not sure what extra time and effort we're talking about >> here? The only extra time and effort I see is in adding this XMP extension >> to it. >> >>> Another thought that just comes to my mind is that a lot of file formats >>> already use XMP as one or even the only metadata container and you would >>> then end up filling the metadata map with the data from the file's XMP and >>> converting it back to XMP later on, compared to just being able to parse it >>> as is and having most of the metadata available right away. >> >> Yep in tika-xmp (new module) this might be less efficient, but it will >> maintain a lot of familiarity with folks who are used to maintaining the >> existing Metadata object internals and models/etc. >> >> Anyways, feel free to push forward, I am just letting you know I am >> against changing the internals of the Metadata model, at least at the >> moment :) At the same time your enthusiasm is great and all I can say >> is you are doing great and push forward and we'll see where we get... >> >> Cheers, >> Chris > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++