Subbu created TIKA-4347:
---------------------------

             Summary: Inconsistency with DefaultZipContainerDetector (and 
possibly others)
                 Key: TIKA-4347
                 URL: https://issues.apache.org/jira/browse/TIKA-4347
             Project: Tika
          Issue Type: Bug
          Components: tika-core
    Affects Versions: 2.4.1
            Reporter: Subbu


We were using Tika detection and for aab files it was using 
DefaultZipContainerDetector and returning type as application/zip. 

Now we used TikaInputStream.getLength() before our detection for empty file 
checks, and suddenly we noticed our content types returned being changed for 
same file.

While digging deeper into this, 

We called the following for our testing,

_detect(foo.aab)_

t{_}is.getLength();{_}

_detect(foo.aab)_

 

and confirmed responses do change.

First time Tika is using detectStreaming

 

[https://github.com/apache/tika/blob/41302d5d6cb96248009b4641e45d76f75cf43195/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L256]

Where it sets a bounded input stream and read only up to certain extent. AAB 
files are typically large, and in our case the Zip created for 
BoundedInputStream doesn't had MANIFEST.MF which is used for detection context 
in JarDetector.

Thus it returns application/zip.

Second time detection returns application/jar because if there is a UnixPath, 
tika uses local file

[https://github.com/apache/tika/blob/41302d5d6cb96248009b4641e45d76f75cf43195/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L187]


This has MANIFEST.MF and returns type as application/jar.

Thus just calling length() makes responses inconsistent and this could be a 
case in other parsers too.

I am not sure if the BoundedInputStream limit can be increased to solve this as 
for a end user detecting from stream/file and returning different responses 
based on that might be inconsistent.

Have only proprietary aab files but any file beyond limit will have this issue. 
Confirmed .ipa files are also affected.

If we agree on how to take this forward, would be happy to contribute a fix. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to