[ 
https://issues.apache.org/jira/browse/TIKA-4357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4357:
------------------------------
    Description: 
There are several places in the codebase where we are mindlessly trusting a 
file's metadata key without namespace prefixing. This is dangerous because user 
data could overwrite metadata from Tika or do other unpleasant things.

There are other places where we were transitioning to namespace prefixes and 
left in the legacy keys without prefixes 
(https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L633).
 

In 4.x, we should look through the codebase and ensure that we are prefixing 
custom metadata keys.

A related idea is that rather than have format specific "custom:" prefixes, we 
use a general prefix for all file formats...WDYT? For those parsers where we 
want to distinguish the raw source of the information -- I'm looking at you pdf 
docinfo and pdf xmp! -- we could use two keys.

  was:
There are several places in the codebase where we are mindlessly trusting a 
file's metadata key without namespace prefixing. This is dangerous because user 
data could overwrite metadata from Tika or do other unpleasant things.

There are other places where we were transitioning to namespace prefixes and 
left in the legacy keys without prefixes 
(https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L633).
 

In 4.x, we should look through the codebase and ensure that we are prefixing 
custom metadata keys.

A related idea is that rather than have format specific "custom:" prefixes, we 
use a general prefix for all file formats...WDYT?


> Ensure namespace prefixes in metadata keys in 4.x
> -------------------------------------------------
>
>                 Key: TIKA-4357
>                 URL: https://issues.apache.org/jira/browse/TIKA-4357
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> There are several places in the codebase where we are mindlessly trusting a 
> file's metadata key without namespace prefixing. This is dangerous because 
> user data could overwrite metadata from Tika or do other unpleasant things.
> There are other places where we were transitioning to namespace prefixes and 
> left in the legacy keys without prefixes 
> (https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L633).
>  
> In 4.x, we should look through the codebase and ensure that we are prefixing 
> custom metadata keys.
> A related idea is that rather than have format specific "custom:" prefixes, 
> we use a general prefix for all file formats...WDYT? For those parsers where 
> we want to distinguish the raw source of the information -- I'm looking at 
> you pdf docinfo and pdf xmp! -- we could use two keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to