[jira] [Commented] (TIKA-4442) PDFParser does not list all metadata extracted by PDFBox

Peter Hoogendijk (Jira) Mon, 23 Jun 2025 22:45:22 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17985774#comment-17985774
 ]


Peter Hoogendijk commented on TIKA-4442:
----------------------------------------

I'm working with confidential files, so even my own testing is rather 
restricted. As I need more PDF-files myself for testing purposes, I'll generate 
some files using PDFBox and share those with you. It would be nice if Tika's 
PDFParser would support the same metadata PyPDF2 does (see 
[https://pypdf2.readthedocs.io/en/3.x/modules/XmpInformation.html]). I did not 
check the PDF specs to see if that implementation is complete, but it at least 
offers the entries I need.

 

I'll have a look at {{PDMetadataExtractor.extractDublin()}} myself, but as you 
know this code it is probably easier for you to implement the changes. I don't 
know the code yet, and due to my job I have to switch languages rather often. 
Always introducing typos when adding comments: // versus # :).

> PDFParser does not list all metadata extracted by PDFBox
> --------------------------------------------------------
>
>                 Key: TIKA-4442
>                 URL: https://issues.apache.org/jira/browse/TIKA-4442
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 3.2.0
>         Environment: * Docker container based on python:3-slim
>  * Debian 12.11
>  * Python 3.13.5
>  * openjdk 17.0.15 2025-04-15
>  * tika-server-standard-3.2.0.jar
>  * pdfbox-app-3.0.5.jar
>  * PyPDF2 3.0.1
>            Reporter: Peter Hoogendijk
>            Priority: Major
>
> While using Apache Tika to extract metadata from PDF files, I found the 
> following XMP metadata entries to be missing:
>  * dc:identifier
>  * dc:language
>  * dc:publisher
>  * dc:relation
>  * dc:source
>  * dc:type
> Python (PyPDF2) and PDFBox (as used by Tika's PDFParser) do show these XMP 
> metadata entries, so I expected Apache Tika to also extract these entries.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4442) PDFParser does not list all metadata extracted by PDFBox

Reply via email to