[jira] [Commented] (TIKA-4357) Ensure namespace prefixes in metadata keys in 4.x

Tim Allison (Jira) Fri, 22 Nov 2024 09:12:04 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-4357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900455#comment-17900455
 ]


Tim Allison commented on TIKA-4357:
-----------------------------------

[~pwyatt], this is, ahem, tricky. Given Tika's bread-and-butter use case of 
extracting text and metadata for search systems etc., we currently only process 
XMP in a few locations within the PDF.

By default, we process only the XMP at the document level. We allow a 
configuration for advanced users (if they register an XMP parser) that will 
also process XMP at the page level. 

In looking more closely at the source code, we also have a TODO in a comment to 
handle XMP metadata associated with images. :\

If enough people want this level of information, we should probably change the 
parameter to extract all XMP as "embedded files" as we do now with the page 
level XMP processing. Obv, this would be a different ticket.


> Ensure namespace prefixes in metadata keys in 4.x
> -------------------------------------------------
>
>                 Key: TIKA-4357
>                 URL: https://issues.apache.org/jira/browse/TIKA-4357
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>              Labels: 4x
>
> [~pwyatt] recently asked me in a DM about some weird metadata keys and 
> duplicate keys in the metadata we're extracting from PDFs. There's a larger 
> issue here that we should address in 4.x...these will be breaking changes.
> There are several places in the codebase where we are mindlessly trusting a 
> file's metadata key without namespace prefixing. This is dangerous because 
> user data could overwrite metadata from Tika or do other unpleasant things.
> There are other places where we were transitioning to namespace prefixes and 
> left in the legacy keys without prefixes 
> (https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L633).
>  
> In 4.x, we should look through the codebase and ensure that we are prefixing 
> custom metadata keys.
> A related idea is that rather than have format specific "custom:" prefixes, 
> we use a general prefix for all file formats...WDYT? For those parsers where 
> we want to distinguish the raw source of the information -- I'm looking at 
> you pdf docinfo and pdf xmp! -- we could use two keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4357) Ensure namespace prefixes in metadata keys in 4.x

Reply via email to