[ https://issues.apache.org/jira/browse/TIKA-4444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17987917#comment-17987917 ]
Tim Allison commented on TIKA-4444: ----------------------------------- Or, more clearly with string values for keys: {noformat} dc:contributor : xmp-dc-contributor dc:creator : xmp-dc-creator dc:description : xmp-dc-description dc:format : application/pdf; version=1.3 dc:identifier : xmp-dc-identifier dc:language : xmp-dc-language dc:publisher : xmp-dc-publisher dc:relation : xmp-dc-relation dc:rights : xmp-dc-rights dc:source : xmp-dc-source dc:subject : xmp-pdf-keywords dc:subject : xmp-dc-subject dc:subject : pdf-keywords dc:subject : pdf-subject dc:title : xmp-dc-title dc:type : xmp-dc-type dcterms:modified : 2025-06-24T10:27:36Z meta:keyword : xmp-pdf-keywords pdf:PDFVersion : 1.3 pdf:charsPerPage : 4708 ... pdf:containsDamagedFont : false pdf:containsNonEmbeddedFont : false pdf:docinfo:creator : pdf-author pdf:docinfo:creator_tool : pdf-creator pdf:docinfo:keywords : pdf-keywords pdf:docinfo:modified : 2025-06-24T10:27:36Z pdf:docinfo:producer : pypdf-5.6.1 pdf:docinfo:subject : pdf-subject pdf:docinfo:title : pdf-title pdf:encrypted : false pdf:eofOffsets : 98604 pdf:eofOffsets : 103951 pdf:hasCollection : false pdf:hasMarkedContent : false pdf:hasXFA : false pdf:hasXMP : true pdf:incrementalUpdateCount : 1 pdf:num3DAnnotations : 0 pdf:ocrPageCount : 0 pdf:overallPercentageUnmappedUnicodeChars : 0.0 pdf:producer : xmp-pdf-producer pdf:totalUnmappedUnicodeChars : 0 pdf:unmappedUnicodeCharsPerPage : 0 ... xmp:CreateDate : 2025-02-16T17:03:17Z xmp:CreatorTool : xmp-xmp-creator-tool xmp:MetadataDate : 2025-02-16T17:03:17Z xmp:ModifyDate : 2025-02-16T17:03:17Z xmp:dc:contributor : xmp-dc-contributor xmp:dc:creator : xmp-dc-creator xmp:dc:description : xmp-dc-description xmp:dc:identifier : xmp-dc-identifier xmp:dc:language : xmp-dc-language xmp:dc:publisher : xmp-dc-publisher xmp:dc:relation : xmp-dc-relation xmp:dc:rights : xmp-dc-rights xmp:dc:source : xmp-dc-source xmp:dc:subject : xmp-dc-subject xmp:dc:title : xmp-dc-title xmp:dc:type : xmp-dc-type xmp:pdf:Keywords : xmp-pdf-keywords xmp:pdf:PDFVersion : xmp-pdf-version xmp:pdf:Producer : xmp-pdf-producer xmpMM:DocumentID : xmp-xmpmm-documentid xmpTPg:NPages : 13 {noformat} > PDFParser shows wrong data in xmp "dc:subject" tag > -------------------------------------------------- > > Key: TIKA-4444 > URL: https://issues.apache.org/jira/browse/TIKA-4444 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 3.2.0 > Environment: * Docker container based on python:3-slim > * Debian 12.11 > * Python 3.13.5 > * openjdk 17.0.15 2025-04-15 > * tika-server-standard-3.2.0.jar > * tika-server-standard-3.2.2-20250624.143628-8.jar > * pdfbox-app-3.0.5.jar > * PyPDF 5.6.1 > Reporter: Peter Hoogendijk > Assignee: Tilman Hausherr > Priority: Major > Labels: xmp > Fix For: 4.0.0, 3.2.1 > > Attachments: lorem-ipsum.pdf, lorem-ipsum.xml > > > The xmp metadata "dc:subject" tag contains the wrong data: it shows a list > with the data from the following tags: > * pdf:docinfo:subject (from the pdf metadata) > * pdf:docinfo:keywords (from the pdf metadata) > * pdf:keywords (from the xmp metadata) > And it is missing the data from the following tags: > * dc:subject (from the xmp metadata) > When looking at the XML for my testfile (see attachments) the xmp metadata > contains the correct "dc:subject" and "pdf:keywords" but: > * Tika shows the wrong data in "dc:subject" (from the xmp metadata) > * Tika does not show "pdf:keywords" (from the xmp metadata) > * Tika does not show the actual "dc:subject" (from the xmp metadata) > This has been tested with tika-server-standard-3.2.0.jar and > tika-server-standard-3.2.2-20250624.143628-8.jar -- This message was sent by Atlassian Jira (v8.20.10#820010)