[ https://issues.apache.org/jira/browse/TIKA-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894520#action_12894520 ]
Nick Burch commented on TIKA-447: --------------------------------- Using the container aware detector will give a more accurate answer generally, but at the cost of more memory use, and longer processing time. (Oh, and plus the need for various parser dependencies) There was some reluctance on-list about making this the default, due to the memory and processing impact of opening the container, which we'll need to take notice of. There's also the issue of making sure the detectors run in the right order, which may matter for some but not for others. Alas I don't have a good answer for the way to handle all these different needs... > Container aware mimetype detection > ---------------------------------- > > Key: TIKA-447 > URL: https://issues.apache.org/jira/browse/TIKA-447 > Project: Tika > Issue Type: New Feature > Components: mime > Affects Versions: 0.7 > Reporter: Nick Burch > Attachments: TikaContainerDetection.patch > > > As discussed on the dev list, Tika should ideally have a configurable way to > process container based formats (eg zip files and ole2 files) when trying to > detect the correct mime type for a document. > This needs to be configurable, because some people won't want Tika to have to > do all the work of parsing the whole file when they're not interested in > knowing exactly what's in it > Once we have gone to the trouble of opening and parsing the container file, > we should try to keep the open container around to speed up parsing of the > contents -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.