[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503982#comment-14503982
 ] 

Nick Burch commented on TIKA-1607:
----------------------------------

Historically, we've always required that things on Metadata be a String, both 
key and value. Properties provide support for converting to/from Strings to 
more helpful types, but allow backwards compatible and simple fetching for 
people who don't want that

Based on the phone number example, this looks somewhat like the streams-style 
indexed metadata that we've been discussing for video and audio, eg "video 
stream 1 has width 640 + height 480, video stream 2 has width 320 + height 240, 
audio stream 1 is stereo + 44.1kHz + english" etc.

Maybe we should work to finish that indexed support off? We'd then keep strings 
everywhere in the metadata, we'd keep backwards compatibility, and we'd keep 
things consistent between different styles of metadata (video, audio, phone 
etc!)

The thread "How should video files with audio be handled by parsers?" from last 
summer outlines a plan, [~rgauss] was going to try and prototype it first 
before committing.

> Introduce new HashMap<String, Object> data structure for persitsence of Tika 
> Metadata
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-1607
>                 URL: https://issues.apache.org/jira/browse/TIKA-1607
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, metadata
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Critical
>             Fix For: 1.9
>
>
> I am currently working implementing more comprehensive extraction and 
> enhancement of the Tika support for Phone number extraction and metadata 
> modeling.
> Right now we utilize the String[] multivalued support available within Tika 
> to persist phone numbers as 
> {code}
> Metadata: String: String[]
> Metadata: phonenumbers: number1, number2, number3, ...
> {code}
> I would like to propose we extend multi-valued support outside of the 
> String[] paradigm by implementing a more abstract Collection of Objects such 
> that we could consider and implement the phone number use case as follows
> {code}
> Metadata: String:  Object
> {code}
> Where Object could be a Collection<HashMap<String/Property, 
> HashMap<String/Property, String/Int/Long>> e.g.
> {code}
> Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
> (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
> LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
> (etc)] 
> {code}
> There are obvious backwards compatibility issues with this approach... 
> additionally it is a fundamental change to the code Metadata API. I hope that 
> the <String, Object> Mapping however is flexible enough to allow me to model 
> Tika Metadata the way I want.
> Any comments folks? Thanks
> Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to