Seva Alekseyev created TIKA-2136:
------------------------------------
Summary: External file links in PPTX misparsed
Key: TIKA-2136
URL: https://issues.apache.org/jira/browse/TIKA-2136
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.13
Environment: Windows 7 x64, JVM 1.8.0_101
Reporter: Seva Alekseyev
Attachments: 81809 lab presentation.pptx
The attached document contains links to external files. Trying to parse it with
the Tika parser throws the following error:
java.lang.NullPointerException
at
org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfEmptyURI(PackagePartName.java:204)
at
org.apache.poi.openxml4j.opc.PackagePartName.throwExceptionIfInvalidPartUri(PackagePartName.java:174)
at
org.apache.poi.openxml4j.opc.PackagePartName.<init>(PackagePartName.java:85)
at
org.apache.poi.openxml4j.opc.PackagingURIHelper.createPartName(PackagingURIHelper.java:493)
at
org.apache.poi.openxml4j.opc.PackagePart.getRelatedPart(PackagePart.java:485)
at
org.apache.poi.xslf.usermodel.XSLFSlideShow.<init>(XSLFSlideShow.java:86)
at
org.apache.poi.xslf.extractor.XSLFPowerPointExtractor.<init>(XSLFPowerPointExtractor.java:62)
at
org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:244)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:86)
at
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
The error happens in the URI validator, but not because the URI fails
validation; the function fails because partURI.getPath() returns a null and
there's no null check. The link in the file may not be valid, but it's not
malformed. And it definitely shouldn't prevent text extraction from the file.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)