[ 
https://issues.apache.org/jira/browse/TIKA-4357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955254#comment-17955254
 ] 

Hudson commented on TIKA-4357:
------------------------------

UNSTABLE: Integrated in Jenkins build Tika ยป tika-main-jdk17 #739 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk17/739/])
TIKA-4357 -- simplify html custom metadata prefix to just html: (#2228) 
(github: 
[https://github.com/apache/tika/commit/82fd5e91d3804907b031718377fa2d62c4b94586])
* (edit) tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java
* (edit) tika-core/src/main/java/org/apache/tika/metadata/HTML.java


> Ensure namespace prefixes in metadata keys in 4.x
> -------------------------------------------------
>
>                 Key: TIKA-4357
>                 URL: https://issues.apache.org/jira/browse/TIKA-4357
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>              Labels: 4x
>
> [~pwyatt] recently asked me in a DM about some weird metadata keys and 
> duplicate keys in the metadata we're extracting from PDFs. There's a larger 
> issue here that we should address in 4.x...these will be breaking changes.
> There are several places in the codebase where we are mindlessly trusting a 
> file's metadata key without namespace prefixing. This is dangerous because 
> user data could overwrite metadata from Tika or do other unpleasant things.
> There are other places where we were transitioning to namespace prefixes and 
> left in the legacy keys without prefixes 
> (https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L633).
>  
> In 4.x, we should look through the codebase and ensure that we are prefixing 
> custom metadata keys.
> A related idea is that rather than have format specific "custom:" prefixes, 
> we use a general prefix for all file formats...WDYT? For those parsers where 
> we want to distinguish the raw source of the information -- I'm looking at 
> you pdf docinfo and pdf xmp! -- we could use two keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to