[ 
https://issues.apache.org/jira/browse/TIKA-930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13281073#comment-13281073
 ] 

Jörg Ehrlich commented on TIKA-930:
-----------------------------------

Consolidation is a good idea.
As a general comment up front, the DublinCore interface contains properties 
from the newer “Terms” namespace (http://dublincore.org/documents/dcmi-terms/). 
Please note that this newer version of DC has not been standardized, yet. So 
the general question is, if the DC interface should use those properties. They 
are interesting, because the new namespace introduces refinements of older 
properties (like “created” and “modified” instead of “date”). But such 
refinements are also available in already standardized namespaces like XMP. 

Here a list of recommendations for the core properties:

Creator:
Remove Author because this is already covered by Creator.
Creator is the Author, there is no need to have two properties for this. And 
DublinCore.Creator should be an ordered Array (as defined in the IPTC spec ) 
instead of a simple text field.
TikaCoreProperties.CREATOR <- DublinCore.CREATOR, { Metadata.CREATOR, 
Office.AUTHOR, MSOffice.AUTHOR}

If Creator becomes an array, INITIAL_AUTHOR or LAST_AUTHOR are not necessarily 
needed anymore.

Creation date:
The original DublinCore.Date is not a specific point in time. That’s why it has 
never been used for a creation date in any application. And that’s why the DC 
organization has set up a newer namespace (see above) which introduces new date 
properties. 
But as this newer namespace is not really used yet, I propose the following:
TikaCoreProperties.CREATION_DATE <- XMP.CREATE_DATE, { DublinCore.CREATED, 
Office.CREATION_DATE, MSOffice.CREATION_DATE, DublinCore.DATE, Metadata.DATE }

Modification date:
I would keep TikaCoreProperties.MODIFIED because so far “modified” has been the 
vocabulary being used for “date the asset has been last saved”. Here again the 
DC property is a newer, not standardized one. Removal of SAVE_DATE is good.
TikaCoreProperties.MODIFIED <-  XMP.MODIFY_DATE, { Office.SAVE_DATE, 
MSOffice.LAST_SAVED, DublinCore.MODIFIED ,Metadata.MODIFIED, "Last-Modified" }

CreatorTool:
Add CreatorTool which is the application that created the asset. It’s different 
to “Creator”.
TikaCoreProperties.CREATOR_TOOL <- XMP.CREATOR_TOOL
I have provided the XMP Namespace in TIKA-908

Rating:
A rating property is being used in almost all applications today, so this 
should be added:
TikaCoreProperties.RATING <- XMP.RATING

Metadata date:
A lot of applications want to know if only the metadata has been changed but 
not the content, e.g. a movie application does not need to render the movie new 
if only the title has been changed. I recommend to add this property.
TikaCoreProperties. METADATA_DATE <- XMP. METADATA_DATE

Geo coordinates:
Almost all camera devices today use the EXIF namespace to capture geo location 
information. I would recommend to use the EXIF properties as primary ones and 
the W3C ones as secondary ones. And rename the Geographic interface to 
something like W3CGeographic.
TikaCoreProperties.LATITUDE <- EXIF.GPS_LATITUDE, {W3CGeographic.LATITUDE}
The same for Longitude and Altitude.

Copyright:
In the future all needed core copyright properties should be added, but as this 
issue is about consolidating existing properties, this can be tracked in a 
follow up issue.

                
> Consolidation of Some Tika Core Properties
> ------------------------------------------
>
>                 Key: TIKA-930
>                 URL: https://issues.apache.org/jira/browse/TIKA-930
>             Project: Tika
>          Issue Type: Improvement
>          Components: metadata
>    Affects Versions: 1.2
>            Reporter: Ray Gauss II
>
> There are a few properties in TikaCoreProperties which overlap and I think we 
> should minimize ambiguity by consolidating them into a single composite 
> property with the clearest name, the most general specification referenced as 
> its primary property, and the others and deprecated strings as its 
> secondaries.
> Here's the proposed pseudo-code for the changes:
> Remove TikaCoreProperties.SUBJECT
> TikaCoreProperties.KEYWORDS <- DublinCore.SUBJECT, { Office.KEYWORDS, 
> MSOffice.KEYWORDS, Metadata.SUBJECT }
> Remove TikaCoreProperties.DATE
> TikaCoreProperties.CREATION_DATE <- DublinCore.DATE, { Office.CREATION_DATE, 
> MSOffice.CREATION_DATE, Metadata.DATE }
> Remove TikaCoreProperties.MODIFIED
> TikaCoreProperties.SAVE_DATE <- DublinCore.MODIFIED, { Office.SAVE_DATE, 
> MSOffice.LAST_SAVED, Metadata.MODIFIED, "Last-Modified" }
> and an example of the Java changes:
> {code:title=TikaCoreProperties.java *Before*}
>     /**
>      * @see DublinCore#SUBJECT
>      */
>     public static final Property SUBJECT = 
> Property.composite(DublinCore.SUBJECT, 
>             new Property[] { Property.internalText(Metadata.SUBJECT) });
>       
>     /**
>      * @see Office#KEYWORDS
>      */
>     public static final Property KEYWORDS = 
> Property.composite(Office.KEYWORDS,
>             new Property[] { Property.internalTextBag(MSOffice.KEYWORDS) });
> {code}
> would become
> {code:title= TikaCoreProperties.java *After*}
>     /**
>      * @see DublinCore#SUBJECT
>      * @see Office#KEYWORDS
>      */
>     public static final Property KEYWORDS = 
> Property.composite(DublinCore.SUBJECT,
>             new Property[] { 
>                   Office.KEYWORDS, 
>                   Property.internalTextBag(MSOffice.KEYWORDS),
>                   Property.internalText(Metadata.SUBJECT)
>               });
> {code}
> Since this would require a bit of refactoring for parsers that use the 
> properties being removed I thought it best to get some feedback before 
> working up a full patch.
> Does this seem like a reasonable approach?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to