[ 
https://issues.apache.org/jira/browse/TIKA-4357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4357:
------------------------------
    Description: 
[~pwyatt] recently asked me in a DM about some weird metadata keys and 
duplicate keys in the metadata we're extracting from PDFs. There's a larger 
issue here that we should address in 4.x...these will be breaking changes.

There are several places in the codebase where we are mindlessly trusting a 
file's metadata key without namespace prefixing. This is dangerous because user 
data could overwrite metadata from Tika or do other unpleasant things.

There are other places where we were transitioning to namespace prefixes and 
left in the legacy keys without prefixes 
(https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L633).
 

In 4.x, we should look through the codebase and ensure that we are prefixing 
custom metadata keys.

A related idea is that rather than have format specific "custom:" prefixes, we 
use a general prefix for all file formats...WDYT? For those parsers where we 
want to distinguish the raw source of the information -- I'm looking at you pdf 
docinfo and pdf xmp! -- we could use two keys.

  was:
There are several places in the codebase where we are mindlessly trusting a 
file's metadata key without namespace prefixing. This is dangerous because user 
data could overwrite metadata from Tika or do other unpleasant things.

There are other places where we were transitioning to namespace prefixes and 
left in the legacy keys without prefixes 
(https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L633).
 

In 4.x, we should look through the codebase and ensure that we are prefixing 
custom metadata keys.

A related idea is that rather than have format specific "custom:" prefixes, we 
use a general prefix for all file formats...WDYT? For those parsers where we 
want to distinguish the raw source of the information -- I'm looking at you pdf 
docinfo and pdf xmp! -- we could use two keys.


> Ensure namespace prefixes in metadata keys in 4.x
> -------------------------------------------------
>
>                 Key: TIKA-4357
>                 URL: https://issues.apache.org/jira/browse/TIKA-4357
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>              Labels: 4x
>
> [~pwyatt] recently asked me in a DM about some weird metadata keys and 
> duplicate keys in the metadata we're extracting from PDFs. There's a larger 
> issue here that we should address in 4.x...these will be breaking changes.
> There are several places in the codebase where we are mindlessly trusting a 
> file's metadata key without namespace prefixing. This is dangerous because 
> user data could overwrite metadata from Tika or do other unpleasant things.
> There are other places where we were transitioning to namespace prefixes and 
> left in the legacy keys without prefixes 
> (https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L633).
>  
> In 4.x, we should look through the codebase and ensure that we are prefixing 
> custom metadata keys.
> A related idea is that rather than have format specific "custom:" prefixes, 
> we use a general prefix for all file formats...WDYT? For those parsers where 
> we want to distinguish the raw source of the information -- I'm looking at 
> you pdf docinfo and pdf xmp! -- we could use two keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to