[jira] [Commented] (TIKA-4347) Inconsistency with DefaultZipContainerDetector (and possibly others)

Tim Allison (Jira) Wed, 13 Nov 2024 06:53:05 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-4347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17897949#comment-17897949
 ]


Tim Allison commented on TIKA-4347:
-----------------------------------

This is a trade-off. I don't know what the best solution is, and we're grateful 
for feedback.

You can force Tika to cache the file to disk (as you're doing with getLength() 
or by a setting in the AutoDetectParserConfig). This will yield the right 
detection for these files. Or, as you said, you can increase the size of 
BoundedInputStream.

If we set as a default to cache to disk, I can guarantee some users will be 
upset that detection is now slower.

If any of the above solutions work for you, let us know. Also, let us know how 
we can improve our documentation.

> Inconsistency with DefaultZipContainerDetector (and possibly others)
> --------------------------------------------------------------------
>
>                 Key: TIKA-4347
>                 URL: https://issues.apache.org/jira/browse/TIKA-4347
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-core
>    Affects Versions: 2.4.1
>            Reporter: Subbu
>            Priority: Major
>
> We were using Tika detection and for aab files it was using 
> DefaultZipContainerDetector and returning type as application/zip. 
> Now we used TikaInputStream.getLength() before our detection for empty file 
> checks, and suddenly we noticed our content types returned being changed for 
> same file.
> While digging deeper into this, 
> We called the following for our testing,
> _detect(foo.aab)_
> t{_}is.getLength();{_}
> _detect(foo.aab)_
>  
> and confirmed responses do change.
> First time Tika is using detectStreaming
>  
> [https://github.com/apache/tika/blob/41302d5d6cb96248009b4641e45d76f75cf43195/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L256]
> Where it sets a bounded input stream and read only up to certain extent. AAB 
> files are typically large, and in our case the Zip created for 
> BoundedInputStream doesn't had MANIFEST.MF which is used for detection 
> context in JarDetector.
> Thus it returns application/zip.
> Second time detection returns application/jar because if there is a UnixPath, 
> tika uses local file
> [https://github.com/apache/tika/blob/41302d5d6cb96248009b4641e45d76f75cf43195/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L187]
> This has MANIFEST.MF and returns type as application/jar.
> Thus just calling length() makes responses inconsistent and this could be a 
> case in other parsers too.
> I am not sure if the BoundedInputStream limit can be increased to solve this 
> as for a end user detecting from stream/file and returning different 
> responses based on that might be inconsistent.
> Have only proprietary aab files but any file beyond limit will have this 
> issue. Confirmed .ipa files are also affected.
> If we agree on how to take this forward, would be happy to contribute a fix. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4347) Inconsistency with DefaultZipContainerDetector (and possibly others)

Reply via email to