[
https://issues.apache.org/jira/browse/TIKA-4347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17922280#comment-17922280
]
Subbu commented on TIKA-4347:
-----------------------------
[~tallison] : I agree making it default cache to disk might be costlier, but I
don't see a way to increase this size of BoundedInputStream. Is it fine if i
make it configurable, with a property? Like environment Property / or any other
property that tika prefers for configuration and create a PR?
> Inconsistency with DefaultZipContainerDetector (and possibly others)
> --------------------------------------------------------------------
>
> Key: TIKA-4347
> URL: https://issues.apache.org/jira/browse/TIKA-4347
> Project: Tika
> Issue Type: Bug
> Components: tika-core
> Affects Versions: 2.4.1
> Reporter: Subbu
> Priority: Major
>
> We were using Tika detection and for aab files it was using
> DefaultZipContainerDetector and returning type as application/zip.
> Now we used TikaInputStream.getLength() before our detection for empty file
> checks, and suddenly we noticed our content types returned being changed for
> same file.
> While digging deeper into this,
> We called the following for our testing,
> _detect(foo.aab)_
> t{_}is.getLength();{_}
> _detect(foo.aab)_
>
> and confirmed responses do change.
> First time Tika is using detectStreaming
>
> [https://github.com/apache/tika/blob/41302d5d6cb96248009b4641e45d76f75cf43195/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L256]
> Where it sets a bounded input stream and read only up to certain extent. AAB
> files are typically large, and in our case the Zip created for
> BoundedInputStream doesn't had MANIFEST.MF which is used for detection
> context in JarDetector.
> Thus it returns application/zip.
> Second time detection returns application/jar because if there is a UnixPath,
> tika uses local file
> [https://github.com/apache/tika/blob/41302d5d6cb96248009b4641e45d76f75cf43195/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L187]
> This has MANIFEST.MF and returns type as application/jar.
> Thus just calling length() makes responses inconsistent and this could be a
> case in other parsers too.
> I am not sure if the BoundedInputStream limit can be increased to solve this
> as for a end user detecting from stream/file and returning different
> responses based on that might be inconsistent.
> Have only proprietary aab files but any file beyond limit will have this
> issue. Confirmed .ipa files are also affected.
> If we agree on how to take this forward, would be happy to contribute a fix.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)