Seva Alekseyev created TIKA-2136:
------------------------------------

             Summary: External file links in PPTX misparsed
                 Key: TIKA-2136
                 URL: https://issues.apache.org/jira/browse/TIKA-2136
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.13
         Environment: Windows 7 x64, JVM 1.8.0_101
            Reporter: Seva Alekseyev
         Attachments: 81809 lab presentation.pptx

The attached document contains links to external files. Trying to parse it with 
the Tika parser throws the following error:

java.lang.NullPointerException
        at 
org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfEmptyURI(PackagePartName.java:204)
        at 
org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:174)
        at 
org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:85)
        at 
org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:493)
        at 
org.apache.poi.openxml4j.opc.PackagePart.getRelatedPart(PackagePart.java:485)
        at 
org.apache.poi.xslf.usermodel.XSLFSlideShow.<init>(XSLFSlideShow.java:86)
        at 
org.apache.poi.xslf.extractor.XSLFPowerPointExtractor.<init>(XSLFPowerPointExtractor.java:62)
        at 
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:244)
        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)

The error happens in the URI validator, but not because the URI fails 
validation; the function fails because partURI.getPath() returns a null and 
there's no null check. The link in the file may not be valid, but it's not 
malformed. And it definitely shouldn't prevent text extraction from the file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to