[ https://issues.apache.org/jira/browse/TIKA-930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281073#comment-13281073 ]
Jörg Ehrlich commented on TIKA-930: ----------------------------------- Consolidation is a good idea. As a general comment up front, the DublinCore interface contains properties from the newer “Terms” namespace (http://dublincore.org/documents/dcmi-terms/). Please note that this newer version of DC has not been standardized, yet. So the general question is, if the DC interface should use those properties. They are interesting, because the new namespace introduces refinements of older properties (like “created” and “modified” instead of “date”). But such refinements are also available in already standardized namespaces like XMP. Here a list of recommendations for the core properties: Creator: Remove Author because this is already covered by Creator. Creator is the Author, there is no need to have two properties for this. And DublinCore.Creator should be an ordered Array (as defined in the IPTC spec ) instead of a simple text field. TikaCoreProperties.CREATOR <- DublinCore.CREATOR, { Metadata.CREATOR, Office.AUTHOR, MSOffice.AUTHOR} If Creator becomes an array, INITIAL_AUTHOR or LAST_AUTHOR are not necessarily needed anymore. Creation date: The original DublinCore.Date is not a specific point in time. That’s why it has never been used for a creation date in any application. And that’s why the DC organization has set up a newer namespace (see above) which introduces new date properties. But as this newer namespace is not really used yet, I propose the following: TikaCoreProperties.CREATION_DATE <- XMP.CREATE_DATE, { DublinCore.CREATED, Office.CREATION_DATE, MSOffice.CREATION_DATE, DublinCore.DATE, Metadata.DATE } Modification date: I would keep TikaCoreProperties.MODIFIED because so far “modified” has been the vocabulary being used for “date the asset has been last saved”. Here again the DC property is a newer, not standardized one. Removal of SAVE_DATE is good. TikaCoreProperties.MODIFIED <- XMP.MODIFY_DATE, { Office.SAVE_DATE, MSOffice.LAST_SAVED, DublinCore.MODIFIED ,Metadata.MODIFIED, "Last-Modified" } CreatorTool: Add CreatorTool which is the application that created the asset. It’s different to “Creator”. TikaCoreProperties.CREATOR_TOOL <- XMP.CREATOR_TOOL I have provided the XMP Namespace in TIKA-908 Rating: A rating property is being used in almost all applications today, so this should be added: TikaCoreProperties.RATING <- XMP.RATING Metadata date: A lot of applications want to know if only the metadata has been changed but not the content, e.g. a movie application does not need to render the movie new if only the title has been changed. I recommend to add this property. TikaCoreProperties. METADATA_DATE <- XMP. METADATA_DATE Geo coordinates: Almost all camera devices today use the EXIF namespace to capture geo location information. I would recommend to use the EXIF properties as primary ones and the W3C ones as secondary ones. And rename the Geographic interface to something like W3CGeographic. TikaCoreProperties.LATITUDE <- EXIF.GPS_LATITUDE, {W3CGeographic.LATITUDE} The same for Longitude and Altitude. Copyright: In the future all needed core copyright properties should be added, but as this issue is about consolidating existing properties, this can be tracked in a follow up issue. > Consolidation of Some Tika Core Properties > ------------------------------------------ > > Key: TIKA-930 > URL: https://issues.apache.org/jira/browse/TIKA-930 > Project: Tika > Issue Type: Improvement > Components: metadata > Affects Versions: 1.2 > Reporter: Ray Gauss II > > There are a few properties in TikaCoreProperties which overlap and I think we > should minimize ambiguity by consolidating them into a single composite > property with the clearest name, the most general specification referenced as > its primary property, and the others and deprecated strings as its > secondaries. > Here's the proposed pseudo-code for the changes: > Remove TikaCoreProperties.SUBJECT > TikaCoreProperties.KEYWORDS <- DublinCore.SUBJECT, { Office.KEYWORDS, > MSOffice.KEYWORDS, Metadata.SUBJECT } > Remove TikaCoreProperties.DATE > TikaCoreProperties.CREATION_DATE <- DublinCore.DATE, { Office.CREATION_DATE, > MSOffice.CREATION_DATE, Metadata.DATE } > Remove TikaCoreProperties.MODIFIED > TikaCoreProperties.SAVE_DATE <- DublinCore.MODIFIED, { Office.SAVE_DATE, > MSOffice.LAST_SAVED, Metadata.MODIFIED, "Last-Modified" } > and an example of the Java changes: > {code:title=TikaCoreProperties.java *Before*} > /** > * @see DublinCore#SUBJECT > */ > public static final Property SUBJECT = > Property.composite(DublinCore.SUBJECT, > new Property[] { Property.internalText(Metadata.SUBJECT) }); > > /** > * @see Office#KEYWORDS > */ > public static final Property KEYWORDS = > Property.composite(Office.KEYWORDS, > new Property[] { Property.internalTextBag(MSOffice.KEYWORDS) }); > {code} > would become > {code:title= TikaCoreProperties.java *After*} > /** > * @see DublinCore#SUBJECT > * @see Office#KEYWORDS > */ > public static final Property KEYWORDS = > Property.composite(DublinCore.SUBJECT, > new Property[] { > Office.KEYWORDS, > Property.internalTextBag(MSOffice.KEYWORDS), > Property.internalText(Metadata.SUBJECT) > }); > {code} > Since this would require a bit of refactoring for parsers that use the > properties being removed I thought it best to get some feedback before > working up a full patch. > Does this seem like a reasonable approach? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira