Subbu created TIKA-4347: --------------------------- Summary: Inconsistency with DefaultZipContainerDetector (and possibly others) Key: TIKA-4347 URL: https://issues.apache.org/jira/browse/TIKA-4347 Project: Tika Issue Type: Bug Components: tika-core Affects Versions: 2.4.1 Reporter: Subbu
We were using Tika detection and for aab files it was using DefaultZipContainerDetector and returning type as application/zip. Now we used TikaInputStream.getLength() before our detection for empty file checks, and suddenly we noticed our content types returned being changed for same file. While digging deeper into this, We called the following for our testing, _detect(foo.aab)_ t{_}is.getLength();{_} _detect(foo.aab)_ and confirmed responses do change. First time Tika is using detectStreaming [https://github.com/apache/tika/blob/41302d5d6cb96248009b4641e45d76f75cf43195/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L256] Where it sets a bounded input stream and read only up to certain extent. AAB files are typically large, and in our case the Zip created for BoundedInputStream doesn't had MANIFEST.MF which is used for detection context in JarDetector. Thus it returns application/zip. Second time detection returns application/jar because if there is a UnixPath, tika uses local file [https://github.com/apache/tika/blob/41302d5d6cb96248009b4641e45d76f75cf43195/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L187] This has MANIFEST.MF and returns type as application/jar. Thus just calling length() makes responses inconsistent and this could be a case in other parsers too. I am not sure if the BoundedInputStream limit can be increased to solve this as for a end user detecting from stream/file and returning different responses based on that might be inconsistent. Have only proprietary aab files but any file beyond limit will have this issue. Confirmed .ipa files are also affected. If we agree on how to take this forward, would be happy to contribute a fix. -- This message was sent by Atlassian Jira (v8.20.10#820010)