Thanks Nick, +1. I'll try and follow and see if I can help in places.
Cheers, Chris On May 16, 2012, at 5:50 AM, Nick Burch wrote: > Hi All > > I've just been brainstorming with Ray Gauss, and we think we've come up with > a way to move towards cleaner and clearer metadata property definitions > (prefixes, properties with types etc), whilst maintaining backwards > compatibility and avoiding too much work for parsers during the migration. > It'll hopefully also help with the larger plan of improving the metadata, and > make life easier for people working on that. > > I'll use DublinCore as an example, but it's not the only one this'll apply to. > > Today, we have all the keys from DublinCore imported onto the Metadata > object, and all the parsers all call eg Metadata.DESCRIPTION rather than > DublinCore.DESCRIPTION. This is a string key, not a property, so there's no > information on it about type etc, and it's a raw key of "description" so > people outside of the Java space (eg tika-cli users) don't know what it is > defined as. > > What I think we'd really like is for that to be a property, with type, with a > key that includes our chosen prefix (so that tika-cli users etc know what it > is), that doesn't break backwards compatibility until 2.0. > > Additionally, we want to identify which properties are common, which all > parsers should be mapping their metadata onto (eg everything should map the > metadata that corresponds roughly to what Dublin Core explains Description to > be, no matter what the format calls it), in addition from any format specific > ones (which only advance users want) > > We think we have a plan! > > In order to avoid breaking backwards compatibility, we've looked and > basically nothing uses the metadata key interfaces directly. Everything seems > to use the Metadata one instead, eg Metadata.DESCRIPTION rather than > DublinCore.DESCRIPTION. So, we think we can change the dublin core one, > provided that Metadata is unchanged. > > Step one is therefore to change all the definitions in Dublin Core to be > proper properties. We copy over the old strings to Metadata, and @deprecate > them (until 2.0). Everything should still work > > Next, we define a class to hold the common Tika metadata properties. These > are the ones we consider to be common across all formats, which parsers > should be trying to populate wherever they can. (Most parsers already do > this, eg for title or description). We'll do a few of these, but we'll need > others to contribute to help decide the rest. These will be delegated out to > a standard property that someone else has already defined, as we do now. > > With that done, we can also specify some aliases, so that when you set one > property it can be defined to also set some others. This allows us to say > "when you set the new dublin core description, for now also go and set the > old style description". This support will also be helpful for mappings on xmp > aware (or similar) formats, to map between their custom properties and our > common ones. > > Finally, we go through the parsers and update them to set the new properties, > rather than the old strings. They'll maintain compatibility for all users > (those using the Java lookups, and those using raw keys eg tika-cli), and > when we drop that in 2.0 the parsers don't need to change > > We'll be opening issues for all of these, and doing the work in small chunks > so everyone can follow. I believe this all fits with what everyone has been > discussing for a while, doesn't break anything, and moves us forward. Despite > the long email, it's actually quite small changes! > > Nick ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++