Giuseppe Totaro created TIKA-1691:
-------------------------------------
Summary: Apache Tika for enabling metadata interoperability
Key: TIKA-1691
URL: https://issues.apache.org/jira/browse/TIKA-1691
Project: Tika
Issue Type: New Feature
Reporter: Giuseppe Totaro
Assignee: Giuseppe Totaro
If am not wrong, enabling consistent metadata across file formats is already
(partially) provided into Tika by relying on {{TikaCoreProperties}} and, within
the context of Solr, {{ExtractingRequestHandler}} (by defining how to map
metadata fields in {{solrconfig.xml}}). However, I am working on a new
component for both schema mapping (to operate on the name of metadata
properties) and instance transformation (to operate on the value of metadata)
that consists, essentially, of the following changes:
* A wrapper of {{Metadata}} object ({{MappedMetadata.java}}) that decorates the
{{set}} method (currently, line number 367 of {{Metadata.java}}) by applying
the given mapping functions (via configuration) before setting metadata
properties.
* Basic mapping functions ({{BasicMappingUtils.java}}) that are utility methods
to map a set of metadata to the target schema.
* A new {{MetadataConfig}} object that, as well as {{TikaConfig}}, may be
configured via XML file (organized as showed in the following snippet) and
allows to perform a fine-grained metadata mapping by using Java reflection.
{code:xml|title=tika-metadata.xml|borderStyle=solid}
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
<mappings>
<mapping type="type/sub-type">
<relation name="SOURCE_FIELD">
<target>TARGET_FIELD</target>
<expression>exclude|include|equivalent|overlap</expression>
<function name="FUNCTION_NAME">
<argument>ARGUMENT_VALUE</argument>
</function>
<cardinality>
<source>SOURCE_CARDINALITY</source>
<target>TARGET_CARDINALITY</target>
<order>ORDER_NUMBER</order>
<dependencies>
<field>FIELD_NAME</field>
</dependencies>
</cardinality>
</relation>
</mapping>
...
<mapping> <!-- This contains the fallback strategy for unknown metadata -->
<relation>
...
</relation>
<mapping>
</mappings>
</properties>
{code}
The theoretical definition of metadata mapping is available in "[A survey of
techniques for achieving metadata
interoperability|http://www.researchgate.net/profile/Bernhard_Haslhofer/publication/220566013_A_survey_of_techniques_for_achieving_metadata_interoperability/links/02e7e533e76187c0b8000000.pdf]".
This paper shows also some basic examples of metadata mappings.
Currently, I am still working on some core functionalities, but I have already
performed some experiments by using a small prototype.
By the way, I think that we should modify the method {{add}} in order to use
{{set}} instead of {{metadata.put}} (currently, line number 316 of
{{Metadata.java}}). This is a trivial change (I could create a new Jira issue
about that), but it would allow to be coherent with the other implementation of
{{add}} method and, moreover, the methods of {{Metadata}} could be extended
more easily.
I would really appreciate your feedback about this proposal. If you believe
that it is a good idea, I could provide the code in few days.
Thanks a lot,
Giuseppe
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)