[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

Tim Allison (JIRA) Fri, 24 Jul 2015 05:16:16 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640370#comment-14640370
 ]


Tim Allison commented on TIKA-1678:
-----------------------------------

I'll try to build a test file today with the fix on PDFBOX-2896.

Because {{parseCOSString()}} is protected in PDFBox 
(o.a.pdfbox.pdfparser.BaseParser), I have to put the Tika class that subclasses 
BaseParser in o.a.pdfbox.pdfparser.  Because of this, we're now getting two 
build warnings:

1) when building the tika-parsers module:
{noformat}
[WARNING] Warning building bundle 
org.apache.tika:tika-parsers:bundle:1.10-SNAPSHOT : Split package 
org/apache/pdfbox/pdfparser
Use directive -split-package:=(merge-first|merge-last|error|first) on 
Export/Private Package instruction to get rid of this warning
Package found in   [Jar:., Jar:pdfbox]
Reference from     
....m2\repository\org\apache\pdfbox\pdfbox\1.8.10\pdfbox-1.8.10.jar
Classpath          [Jar:., Jar:org.osgi.core,
.... 
{noformat}

And during the build of tika-app:
{noformat}
[WARNING] tika-parsers-1.10-SNAPSHOT.jar, pdfbox-1.8.10.jar define 19 
overlapppi     ng classes:
[WARNING]   - org.apache.pdfbox.pdfparser.PDFParser$ConflictObj
[WARNING]   - org.apache.pdfbox.pdfparser.PDFObjectStreamParser
[WARNING]   - org.apache.pdfbox.pdfparser.PDFXRef
[WARNING]   - org.apache.pdfbox.pdfparser.VisualSignatureParser
[WARNING]   - org.apache.pdfbox.pdfparser.NonSequentialPDFParser
[WARNING]   - org.apache.pdfbox.pdfparser.BaseParser
[WARNING]   - org.apache.pdfbox.pdfparser.PDFStreamParser$1
[WARNING]   - org.apache.pdfbox.pdfparser.XrefTrailerResolver
[WARNING]   - org.apache.pdfbox.pdfparser.PDFStreamParser
[WARNING]   - org.apache.pdfbox.pdfparser.PDFXRefStream$NormalReference
[WARNING]   - 9 more...
[WARNING] maven-shade-plugin has detected that some .class files
[WARNING] are present in two or more JARs. When this happens, only
[WARNING] one single version of the class is copied in the uberjar.
[WARNING] Usually this is not harmful and you can skeep these
[WARNING] warnings, otherwise try to manually exclude artifacts
[WARNING] based on mvn dependency:tree -Ddetail=true and the above
[WARNING] output
{noformat}

Any recommendations for a fix?  Some options that I see:

1) Ignore the warnings (don't like this in principal)
2) Figure out the right way to -split-package-merge-x (don't want to wallpaper 
over other cases where this might happen in the future with other packages 
unrelated to this issue)
3) Copy and paste relevant code from PDFBox (that particular call isn't 
extremely long/compliex, but I don't like this for maintainability/upgrades, 
etc.)

Other options?

[~bobpaulin], any recommendations from the OSGi side?
 

> PDF metadata extraction fails to spot UTF-16 encoded title
> ----------------------------------------------------------
>
>                 Key: TIKA-1678
>                 URL: https://issues.apache.org/jira/browse/TIKA-1678
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata
>    Affects Versions: 1.9
>            Reporter: Andrew Jackson
>            Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority 
> of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
> not always being decoded as such.
> A specific example is here: 
> http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> <</Type/Metadata
> /Subtype/XML/Length 1978>>stream
> <?xpacket begin='ï»¿' id='W5M0MpCehiHzreSzNTczkc9d'?>
> <?adobe-xap-filters esc="CRLF"?>
> <x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 
> 1.6'>
> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' 
> xmlns:iX='http://ns.adobe.com/iX/1.0/'>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
> pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n'/>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:xmp='http://ns.adobe.com/xap/1.0/'><xmp:ModifyDate>2012-07-18T15:38:01+01:00</xmp:ModifyDate>
> <xmp:CreateDate>2012-07-18T15:38:01+01:00</xmp:CreateDate>
> <xmp:CreatorTool>UnknownApplication</xmp:CreatorTool></rdf:Description>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
> xapMM:DocumentID='ac9f232e-d341-11e1-0000-ba905bfc4694'/>
> <rdf:Description rdf:about='ac9f232e-d341-11e1-0000-ba905bfc4694' 
> xmlns:dc='http://purl.org/dc/elements/1.1/' 
> dc:format='application/pdf'><dc:title><rdf:Alt><rdf:li 
> xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
>  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x</rdf:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>\376\377\000T\000e\000t\000t\000i</rdf:li></rdf:Seq></dc:creator></rdf:Description>
> </rdf:RDF>
> </x:xmpmeta>
> <?xpacket end='w'?>
> endstream
> endobj
> 2 0 obj
> <</Producer(\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
> error, but the ones encoded in the actual PDF metadata fields should be 
> extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> meta:author: Tetti
> ...
> {noformat}
> So, the author appears to be decoded correctly once, but the title is not. Is 
> the XML dc:title being used to override the PDF title field? Or is one of the 
> title fields being decoded incorrectly?
> (I accept that although this is a real PDF document from the web, it is also 
> a malformed one, so maybe there is not much to be done here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

Reply via email to