[ 
https://issues.apache.org/jira/browse/TIKA-4347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17922280#comment-17922280
 ] 

Subbu commented on TIKA-4347:
-----------------------------

[~tallison]  : I agree making it default cache to disk might be costlier, but I 
don't see a way to increase this size of BoundedInputStream. Is it fine if i 
make it configurable, with a property? Like environment Property / or any other 
property that tika prefers for configuration and create a PR? 

> Inconsistency with DefaultZipContainerDetector (and possibly others)
> --------------------------------------------------------------------
>
>                 Key: TIKA-4347
>                 URL: https://issues.apache.org/jira/browse/TIKA-4347
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-core
>    Affects Versions: 2.4.1
>            Reporter: Subbu
>            Priority: Major
>
> We were using Tika detection and for aab files it was using 
> DefaultZipContainerDetector and returning type as application/zip. 
> Now we used TikaInputStream.getLength() before our detection for empty file 
> checks, and suddenly we noticed our content types returned being changed for 
> same file.
> While digging deeper into this, 
> We called the following for our testing,
> _detect(foo.aab)_
> t{_}is.getLength();{_}
> _detect(foo.aab)_
>  
> and confirmed responses do change.
> First time Tika is using detectStreaming
>  
> [https://github.com/apache/tika/blob/41302d5d6cb96248009b4641e45d76f75cf43195/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L256]
> Where it sets a bounded input stream and read only up to certain extent. AAB 
> files are typically large, and in our case the Zip created for 
> BoundedInputStream doesn't had MANIFEST.MF which is used for detection 
> context in JarDetector.
> Thus it returns application/zip.
> Second time detection returns application/jar because if there is a UnixPath, 
> tika uses local file
> [https://github.com/apache/tika/blob/41302d5d6cb96248009b4641e45d76f75cf43195/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L187]
> This has MANIFEST.MF and returns type as application/jar.
> Thus just calling length() makes responses inconsistent and this could be a 
> case in other parsers too.
> I am not sure if the BoundedInputStream limit can be increased to solve this 
> as for a end user detecting from stream/file and returning different 
> responses based on that might be inconsistent.
> Have only proprietary aab files but any file beyond limit will have this 
> issue. Confirmed .ipa files are also affected.
> If we agree on how to take this forward, would be happy to contribute a fix. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to