[ 
https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894507#action_12894507
 ] 

Alex Ott commented on TIKA-447:
-------------------------------

It's better to have some flag, that will say "Stop, if this rule matched", 
because applying of all rules, could lead to weak performance
It's better to have something like, for example for zips
 - rule for jar: zip-type == X1
 - rule for odf: zip-type == X2
.....

zip-type will calculated once on first invocation, and then re-used.  And all 
rules (for jar, odf, etc.) have no flag "Stop here", while there will rule for 
ordinary zip's, that will have this flag, and we'll stop after checking of all 
subtypes.
The same is could be implemented for OLE2 and other container formats, like 
OGG, etc.


> Container aware mimetype detection
> ----------------------------------
>
>                 Key: TIKA-447
>                 URL: https://issues.apache.org/jira/browse/TIKA-447
>             Project: Tika
>          Issue Type: New Feature
>          Components: mime
>    Affects Versions: 0.7
>            Reporter: Nick Burch
>         Attachments: TikaContainerDetection.patch
>
>
> As discussed on the dev list, Tika should ideally have a configurable way to 
> process container based formats (eg zip files and ole2 files) when trying to 
> detect the correct mime type for a document.
> This needs to be configurable, because some people won't want Tika to have to 
> do all the work of parsing the whole file when they're not interested in 
> knowing exactly what's in it
> Once we have gone to the trouble of opening and parsing the container file, 
> we should try to keep the open container around to speed up parsing of the 
> contents

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to