[ https://issues.apache.org/jira/browse/TIKA-4347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17922280#comment-17922280 ]
Subbu commented on TIKA-4347: ----------------------------- [~tallison] : I agree making it default cache to disk might be costlier, but I don't see a way to increase this size of BoundedInputStream. Is it fine if i make it configurable, with a property? Like environment Property / or any other property that tika prefers for configuration and create a PR? > Inconsistency with DefaultZipContainerDetector (and possibly others) > -------------------------------------------------------------------- > > Key: TIKA-4347 > URL: https://issues.apache.org/jira/browse/TIKA-4347 > Project: Tika > Issue Type: Bug > Components: tika-core > Affects Versions: 2.4.1 > Reporter: Subbu > Priority: Major > > We were using Tika detection and for aab files it was using > DefaultZipContainerDetector and returning type as application/zip. > Now we used TikaInputStream.getLength() before our detection for empty file > checks, and suddenly we noticed our content types returned being changed for > same file. > While digging deeper into this, > We called the following for our testing, > _detect(foo.aab)_ > t{_}is.getLength();{_} > _detect(foo.aab)_ > > and confirmed responses do change. > First time Tika is using detectStreaming > > [https://github.com/apache/tika/blob/41302d5d6cb96248009b4641e45d76f75cf43195/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L256] > Where it sets a bounded input stream and read only up to certain extent. AAB > files are typically large, and in our case the Zip created for > BoundedInputStream doesn't had MANIFEST.MF which is used for detection > context in JarDetector. > Thus it returns application/zip. > Second time detection returns application/jar because if there is a UnixPath, > tika uses local file > [https://github.com/apache/tika/blob/41302d5d6cb96248009b4641e45d76f75cf43195/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L187] > This has MANIFEST.MF and returns type as application/jar. > Thus just calling length() makes responses inconsistent and this could be a > case in other parsers too. > I am not sure if the BoundedInputStream limit can be increased to solve this > as for a end user detecting from stream/file and returning different > responses based on that might be inconsistent. > Have only proprietary aab files but any file beyond limit will have this > issue. Confirmed .ipa files are also affected. > If we agree on how to take this forward, would be happy to contribute a fix. -- This message was sent by Atlassian Jira (v8.20.10#820010)