[jira] [Commented] (TIKA-4441) InputStream is consumed by Tika.detect for certain files

Tim Allison (Jira) Wed, 25 Jun 2025 10:12:41 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17986194#comment-17986194
 ]


Tim Allison commented on TIKA-4441:
-----------------------------------

This is tricky. It was a breaking change, and I'm sorry for that.

The challenge is that we want to have a use case where we spool the entire 
stream to disk for accurate POIFS container detection. TikaInputStream allows 
that with {{{}getPath(-1){}}}. The problem is that if there's an underlying 
stream, we can't spool the full stream to disk and then reset the underlying 
stream. We have to set some limit. Before the last code change, the limit was 
128MB.

When I revert that now, everything works as expected.

I propose that we revert this in 3.x, and document this change in 4.x/main.

Users can configure the marklimit in both 3.x and 4.x...along these lines. I 
confirmed that this fixes the problem.
{noformat}
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
  <detectors>
    <detector class="org.gagravarr.tika.OggDetector"/>
    <detector class="org.apache.tika.detect.apple.BPListDetector"/>
    <detector class="org.apache.tika.detect.gzip.GZipSpecializationDetector"/>
    <detector class="org.apache.tika.detect.microsoft.POIFSContainerDetector">
      <params>
        <param name="markLimit" type="int">1200000</param>
      </params>
    </detector>
    <detector class="org.apache.tika.detect.ole.MiscOLEDetector"/>
    <detector class="org.apache.tika.detect.zip.DefaultZipContainerDetector">
      <params>
        <param name="markLimit" type="int">16777216</param>
      </params>
    </detector>
    <detector class="org.apache.tika.mime.MimeTypes"/>
  </detectors>
</properties>  {noformat}
 

> InputStream is consumed by Tika.detect for certain files
> --------------------------------------------------------
>
>                 Key: TIKA-4441
>                 URL: https://issues.apache.org/jira/browse/TIKA-4441
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 3.2.0, 3.2.1
>            Reporter: Alvaro
>            Priority: Major
>         Attachments: Test.doc, Test.ppt, Test.xls
>
>
> Hello,
> We've been using Tika version 3.1.0 to successfully detect MimeTypes from 
> files before uploading them to our S3.
> However, after v3.2.0 upgrade, we've noticed that the original inputStream is 
> being consumed entirely for certain file extensions.
> The affected extensions seem to be all for Microsoft files, pointing us to 
> the POIFSContainerDetector, which was actually changed for this release. 
> This is the list of extensions we've tested with errors: doc, docx, odt, ppt, 
> pptx, xls, xlsx
> And these ones work as before: bmp, csv, gif, jpeg, jpg, pdf, png, rtf, svg, 
> txt
>  
> Here's some code to reproduce the issue:
> {code:java}
>  class TikaBugReport {
>     // affected extensions: doc, docx, odt, ppt, pptx, xls, xlsx 
>     public static void main(String[] args) throws IOException {
>         String fileName = "Test.docx";
>         InputStream inputStream = new 
> ClassPathResource(fileName).getInputStream();
>         checkFileMime(inputStream, fileName);
>     }
>     public static void checkFileMime(InputStream inputStream, String 
> fileName) {
>         try {
>             Tika tika = new Tika();
>             System.out.println("InputStream available bytes before 
> processing: " + inputStream.available());
>             System.out.println("InputStream supports mark: " + 
> inputStream.markSupported());
>             Metadata metadata = new Metadata();
>             TikaInputStream tikaInputStream = 
> TikaInputStream.get(inputStream);
>             System.out.println("Original InputStream available bytes after 
> TikaInputStream.get(): " + inputStream.available());
>             String mimeType = tika.detect(tikaInputStream, metadata);
>             // Debug: Check state after detection
>             System.out.println("Original InputStream available bytes after 
> tika.detect(): " + inputStream.available());
>             System.out.println("TikaInputStream available bytes after 
> tika.detect(): " + tikaInputStream.available());
>             if (inputStream.available() == 0) {
>                 throw new IllegalStateException("InputStream is empty after 
> TikaInputStream creation");
>             }
>         } catch (Exception e) {
>             System.out.printf("Mime check exception for file '%s': [%s]%n", 
> fileName, e.getMessage());
>         }
>     }
> }{code}
> After testing version 3.2.1, the issue is fixed for most file extensions, but 
> .doc, .ppt and .xls extensions are still failing. Find sample files attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4441) InputStream is consumed by Tika.detect for certain files

Reply via email to