Andriy Budzinskyy created TIKA-1761:
---------------------------------------
Summary: Error Parsing PPT (97-2003) files with password
protection against modification which were created using Office 2013
Key: TIKA-1761
URL: https://issues.apache.org/jira/browse/TIKA-1761
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.10, 1.7
Reporter: Andriy Budzinskyy
Attachments: test-2007.ppt, test-2013.ppt
PPT documents created (or saved) as Powerpoint 97-2003 format and protected
with password against modification using Office 2013 fail during extracting
text.
But it works fine Powerpoint 97-2003 format using Office 2007
{noformat}
java -jar tika-app-1.10.jar --text test_2003.ppt
Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@22b0f5af
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:185)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:489)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:139)
Caused by: org.apache.poi.hslf.exceptions.EncryptedPowerPointFileException:
PowerPoint file is encrypted. The correct password needs to be set via
Biff8EncryptionKey.setCurrentUserPassword()
at
org.apache.poi.hslf.EncryptedSlideShow.<init>(EncryptedSlideShow.java:102)
at org.apache.poi.hslf.HSLFSlideShow.read(HSLFSlideShow.java:259)
at
org.apache.poi.hslf.HSLFSlideShow.buildRecords(HSLFSlideShow.java:250)
at org.apache.poi.hslf.HSLFSlideShow.<init>(HSLFSlideShow.java:165)
at
org.apache.tika.parser.microsoft.HSLFExtractor.parse(HSLFExtractor.java:61)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:149)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 5 more
{noformat}
I've debugged Tika library and found that it fails
UserEditAtom.encryptSessionPersistIdRef property. This property is empty in
files created with Office 2007 and no-empty with Office 2013.
I've defragmented PPT files as described in
https://social.msdn.microsoft.com/Forums/en-US/e33189a5-0b00-44b7-b084-f2757e9b7536/powerpoint-binary-file-format-decryption?forum=os_binaryfile
Is this bug of Tika or POI library?
Should be it supported per Apache POI [encryption
support|https://poi.apache.org/encryption.html]?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)