[ 
https://issues.apache.org/jira/browse/TIKA-4444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17987917#comment-17987917
 ] 

Tim Allison commented on TIKA-4444:
-----------------------------------

Or, more clearly with string values for keys:

{noformat}
dc:contributor : xmp-dc-contributor
dc:creator : xmp-dc-creator
dc:description : xmp-dc-description
dc:format : application/pdf; version=1.3
dc:identifier : xmp-dc-identifier
dc:language : xmp-dc-language
dc:publisher : xmp-dc-publisher
dc:relation : xmp-dc-relation
dc:rights : xmp-dc-rights
dc:source : xmp-dc-source
dc:subject : xmp-pdf-keywords
dc:subject : xmp-dc-subject
dc:subject : pdf-keywords
dc:subject : pdf-subject
dc:title : xmp-dc-title
dc:type : xmp-dc-type
dcterms:modified : 2025-06-24T10:27:36Z
meta:keyword : xmp-pdf-keywords
pdf:PDFVersion : 1.3
pdf:charsPerPage : 4708
...
pdf:containsDamagedFont : false
pdf:containsNonEmbeddedFont : false
pdf:docinfo:creator : pdf-author
pdf:docinfo:creator_tool : pdf-creator
pdf:docinfo:keywords : pdf-keywords
pdf:docinfo:modified : 2025-06-24T10:27:36Z
pdf:docinfo:producer : pypdf-5.6.1
pdf:docinfo:subject : pdf-subject
pdf:docinfo:title : pdf-title
pdf:encrypted : false
pdf:eofOffsets : 98604
pdf:eofOffsets : 103951
pdf:hasCollection : false
pdf:hasMarkedContent : false
pdf:hasXFA : false
pdf:hasXMP : true
pdf:incrementalUpdateCount : 1
pdf:num3DAnnotations : 0
pdf:ocrPageCount : 0
pdf:overallPercentageUnmappedUnicodeChars : 0.0
pdf:producer : xmp-pdf-producer
pdf:totalUnmappedUnicodeChars : 0
pdf:unmappedUnicodeCharsPerPage : 0
...
xmp:CreateDate : 2025-02-16T17:03:17Z
xmp:CreatorTool : xmp-xmp-creator-tool
xmp:MetadataDate : 2025-02-16T17:03:17Z
xmp:ModifyDate : 2025-02-16T17:03:17Z
xmp:dc:contributor : xmp-dc-contributor
xmp:dc:creator : xmp-dc-creator
xmp:dc:description : xmp-dc-description
xmp:dc:identifier : xmp-dc-identifier
xmp:dc:language : xmp-dc-language
xmp:dc:publisher : xmp-dc-publisher
xmp:dc:relation : xmp-dc-relation
xmp:dc:rights : xmp-dc-rights
xmp:dc:source : xmp-dc-source
xmp:dc:subject : xmp-dc-subject
xmp:dc:title : xmp-dc-title
xmp:dc:type : xmp-dc-type
xmp:pdf:Keywords : xmp-pdf-keywords
xmp:pdf:PDFVersion : xmp-pdf-version
xmp:pdf:Producer : xmp-pdf-producer
xmpMM:DocumentID : xmp-xmpmm-documentid
xmpTPg:NPages : 13
{noformat}

> PDFParser shows wrong data in xmp "dc:subject" tag
> --------------------------------------------------
>
>                 Key: TIKA-4444
>                 URL: https://issues.apache.org/jira/browse/TIKA-4444
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 3.2.0
>         Environment: * Docker container based on python:3-slim
>  * Debian 12.11
>  * Python 3.13.5
>  * openjdk 17.0.15 2025-04-15
>  * tika-server-standard-3.2.0.jar
>  * tika-server-standard-3.2.2-20250624.143628-8.jar
>  * pdfbox-app-3.0.5.jar
>  * PyPDF 5.6.1
>            Reporter: Peter Hoogendijk
>            Assignee: Tilman Hausherr
>            Priority: Major
>              Labels: xmp
>             Fix For: 4.0.0, 3.2.1
>
>         Attachments: lorem-ipsum.pdf, lorem-ipsum.xml
>
>
> The xmp metadata "dc:subject" tag contains the wrong data: it shows a list 
> with the data from the following tags:
>  * pdf:docinfo:subject (from the pdf metadata)
>  * pdf:docinfo:keywords (from the pdf metadata)
>  * pdf:keywords (from the xmp metadata)
> And it is missing the data from the following tags:
>  * dc:subject (from the xmp metadata)
> When looking at the XML for my testfile (see attachments) the xmp metadata 
> contains the correct "dc:subject" and "pdf:keywords" but:
>  * Tika shows the wrong data in "dc:subject" (from the xmp metadata)
>  * Tika does not show "pdf:keywords" (from the xmp metadata)
>  * Tika does not show the actual "dc:subject" (from the xmp metadata)
> This has been tested with tika-server-standard-3.2.0.jar and 
> tika-server-standard-3.2.2-20250624.143628-8.jar



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to