[ 
https://issues.apache.org/jira/browse/TIKA-4357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17900542#comment-17900542
 ] 

Peter Wyatt commented on TIKA-4357:
-----------------------------------

It's perfectly OK that Tika doesn't attempt to identify any and all *Metadata* 
streams since these can occur on ({_}almost -see [this PDF 
Errata|[https://github.com/pdf-association/pdf-issues/issues/403]]!{_}) any PDF 
object in the DOM. The workaround for this would be to do a recursive "extract 
and process".

But for *Metadata* streams associated with images (if/when you get to this), it 
would be good to have the prefix include something like the PDF object 
identifier so these are uniquely identifiable and traceable back into the PDF.

If you want some sample PDFs with object-level *Metadata* for testing, then the 
PDF Association recently updated the PDF/VT CalPoly test suite to v1.0.2 - see 
https://pdfa.org/resource/cal-poly-pdfvt-test-suite/

> Ensure namespace prefixes in metadata keys in 4.x
> -------------------------------------------------
>
>                 Key: TIKA-4357
>                 URL: https://issues.apache.org/jira/browse/TIKA-4357
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>              Labels: 4x
>
> [~pwyatt] recently asked me in a DM about some weird metadata keys and 
> duplicate keys in the metadata we're extracting from PDFs. There's a larger 
> issue here that we should address in 4.x...these will be breaking changes.
> There are several places in the codebase where we are mindlessly trusting a 
> file's metadata key without namespace prefixing. This is dangerous because 
> user data could overwrite metadata from Tika or do other unpleasant things.
> There are other places where we were transitioning to namespace prefixes and 
> left in the legacy keys without prefixes 
> (https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java#L633).
>  
> In 4.x, we should look through the codebase and ensure that we are prefixing 
> custom metadata keys.
> A related idea is that rather than have format specific "custom:" prefixes, 
> we use a general prefix for all file formats...WDYT? For those parsers where 
> we want to distinguish the raw source of the information -- I'm looking at 
> you pdf docinfo and pdf xmp! -- we could use two keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to