[ https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503982#comment-14503982 ]
Nick Burch commented on TIKA-1607: ---------------------------------- Historically, we've always required that things on Metadata be a String, both key and value. Properties provide support for converting to/from Strings to more helpful types, but allow backwards compatible and simple fetching for people who don't want that Based on the phone number example, this looks somewhat like the streams-style indexed metadata that we've been discussing for video and audio, eg "video stream 1 has width 640 + height 480, video stream 2 has width 320 + height 240, audio stream 1 is stereo + 44.1kHz + english" etc. Maybe we should work to finish that indexed support off? We'd then keep strings everywhere in the metadata, we'd keep backwards compatibility, and we'd keep things consistent between different styles of metadata (video, audio, phone etc!) The thread "How should video files with audio be handled by parsers?" from last summer outlines a plan, [~rgauss] was going to try and prototype it first before committing. > Introduce new HashMap<String, Object> data structure for persitsence of Tika > Metadata > ------------------------------------------------------------------------------------- > > Key: TIKA-1607 > URL: https://issues.apache.org/jira/browse/TIKA-1607 > Project: Tika > Issue Type: Improvement > Components: core, metadata > Reporter: Lewis John McGibbney > Assignee: Lewis John McGibbney > Priority: Critical > Fix For: 1.9 > > > I am currently working implementing more comprehensive extraction and > enhancement of the Tika support for Phone number extraction and metadata > modeling. > Right now we utilize the String[] multivalued support available within Tika > to persist phone numbers as > {code} > Metadata: String: String[] > Metadata: phonenumbers: number1, number2, number3, ... > {code} > I would like to propose we extend multi-valued support outside of the > String[] paradigm by implementing a more abstract Collection of Objects such > that we could consider and implement the phone number use case as follows > {code} > Metadata: String: Object > {code} > Where Object could be a Collection<HashMap<String/Property, > HashMap<String/Property, String/Int/Long>> e.g. > {code} > Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), > (LibPN-NumberType: International), (etc: etc)...), (+1292611054: > LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) > (etc)] > {code} > There are obvious backwards compatibility issues with this approach... > additionally it is a fundamental change to the code Metadata API. I hope that > the <String, Object> Mapping however is flexible enough to allow me to model > Tika Metadata the way I want. > Any comments folks? Thanks > Lewis -- This message was sent by Atlassian JIRA (v6.3.4#6332)