[
https://issues.apache.org/jira/browse/TIKA-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158444#comment-13158444
]
Nick Burch commented on TIKA-791:
---------------------------------
One thing - I'm not sure that we should be returning the same mimetype for a
regular .xlsx file and a password protected .xlsx file. One is zip based, one
is encrypted ole2. I'd say it's a similar situation to us not returning the
same mimetype for .tar and .tar.gz - while they are both technically tar files,
one is directly tar and the other is a wrapper tar that needs unpacking first.
In this case, the protected ooxml files need special handling before they turn
into normal ooxml files, so I don't believe we should be treating them
interchangeably
> Fix the detection of protected OOXML files
> ------------------------------------------
>
> Key: TIKA-791
> URL: https://issues.apache.org/jira/browse/TIKA-791
> Project: Tika
> Issue Type: Improvement
> Components: mime
> Affects Versions: 1.1
> Environment: Windows 7 64 bit
> Reporter: Antoni Mylka
> Attachments: tika-791-ver2.zip, tika-791.zip
>
>
> TIKA-437 patch allowed Tika to work with OOXML files protected with the
> default VelvetSweatshop password. I feel there is room for improvement.
> # The POIFSContainerDetector lies when it sees such a file. It should be able
> to mark it as x-tika-ooxml
> # The OOXMLParser can't work with such a file. It should:
> ## If it's protected with the default password - it should be decrypted and
> processed normally.
> ## If it's protected with a non-default password - the file should be marked
> as protected, no weird exceptions should appear.
> Therefore I'd like to add an 'if' to POIFSContainerDetector which returns
> x-tika-ooxml, and some code to OOXMLParser, which would be similar to the
> code currently residing in OfficeParser. After this improvement both the
> OfficeParser and the OOXMLParser will treat such files in the same way.
> When I have that, I can add a hack in my application, which will say "If the
> type is x-tika-ooxml and the name-based detection is a specialization of
> ooxml, then use the name-based detection". This will be a workaround for the
> fact that in MimeTypes, magic always trumps the name. With that, the
> encrypted DOCX files will appear with the normal DOCX mimetype in my app.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira