[jira] (TIKA-4441) InputStream is consumed by Tika.detect for certain files

Tilman Hausherr (Jira) Wed, 25 Jun 2025 04:18:05 -0700


    [ https://issues.apache.org/jira/browse/TIKA-4441 ]



    Tilman Hausherr deleted comment on TIKA-4441:
    ---------------------------------------

was (Author: tilman):
Some intermediate result
{code:java}
MediaTypeRegistry defaultRegistry = MediaTypeRegistry.getDefaultRegistry();
MediaType type = MediaType.OCTET_STREAM;
for (Detector detector : firstDetector.getDetectors())
{
    System.out.println("Original InputStream available bytes before 
tika.detect() with " + detector.getClass().getSimpleName() + ": " + 
is2.available());
    MediaType detected = detector.detect(tikaInputStream, metadata);
    System.out.println("Original InputStream available bytes after  
tika.detect() with " + detector.getClass().getSimpleName() + ": " + 
is2.available() + ", detected: " + detected);
    if (defaultRegistry.isSpecializationOf(detected, type))
    {
        type = detected;
    }
}
System.out.println("type: " + type);
{code}

output:

Original InputStream available bytes before tika.detect() with OggDetector: 
98304
Original InputStream available bytes after  tika.detect() with OggDetector: 
98304, detected: application/octet-stream
Original InputStream available bytes before tika.detect() with BPListDetector: 
98304
Original InputStream available bytes after  tika.detect() with BPListDetector: 
98304, detected: application/octet-stream
Original InputStream available bytes before tika.detect() with 
GZipSpecializationDetector: 98304
Original InputStream available bytes after  tika.detect() with 
GZipSpecializationDetector: 98304, detected: application/octet-stream
Original InputStream available bytes before tika.detect() with 
POIFSContainerDetector: 98304
Original InputStream available bytes after  tika.detect() with 
POIFSContainerDetector: 0, detected: application/msword
Original InputStream available bytes before tika.detect() with MiscOLEDetector: 0
Original InputStream available bytes after  tika.detect() with MiscOLEDetector: 
0, detected: application/x-tika-msoffice
Original InputStream available bytes before tika.detect() with 
DefaultZipContainerDetector: 0
Original InputStream available bytes after  tika.detect() with 
DefaultZipContainerDetector: 0, detected: application/octet-stream
Original InputStream available bytes before tika.detect() with MimeTypes: 0
Original InputStream available bytes after  tika.detect() with MimeTypes: 0, 
detected: application/x-tika-msoffice
type: application/msword

So I guess we'll have a look at {{POIFSContainerDetector}}

> InputStream is consumed by Tika.detect for certain files
> --------------------------------------------------------
>
>                 Key: TIKA-4441
>                 URL: https://issues.apache.org/jira/browse/TIKA-4441
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 3.2.0, 3.2.1
>            Reporter: Alvaro
>            Priority: Major
>         Attachments: Test.doc, Test.ppt, Test.xls
>
>
> Hello,
> We've been using Tika version 3.1.0 to successfully detect MimeTypes from 
> files before uploading them to our S3.
> However, after v3.2.0 upgrade, we've noticed that the original inputStream is 
> being consumed entirely for certain file extensions.
> The affected extensions seem to be all for Microsoft files, pointing us to 
> the POIFSContainerDetector, which was actually changed for this release. 
> This is the list of extensions we've tested with errors: doc, docx, odt, ppt, 
> pptx, xls, xlsx
> And these ones work as before: bmp, csv, gif, jpeg, jpg, pdf, png, rtf, svg, 
> txt
>  
> Here's some code to reproduce the issue:
> {code:java}
>  class TikaBugReport {
>     // affected extensions: doc, docx, odt, ppt, pptx, xls, xlsx 
>     public static void main(String[] args) throws IOException {
>         String fileName = "Test.docx";
>         InputStream inputStream = new 
> ClassPathResource(fileName).getInputStream();
>         checkFileMime(inputStream, fileName);
>     }
>     public static void checkFileMime(InputStream inputStream, String 
> fileName) {
>         try {
>             Tika tika = new Tika();
>             System.out.println("InputStream available bytes before 
> processing: " + inputStream.available());
>             System.out.println("InputStream supports mark: " + 
> inputStream.markSupported());
>             Metadata metadata = new Metadata();
>             TikaInputStream tikaInputStream = 
> TikaInputStream.get(inputStream);
>             System.out.println("Original InputStream available bytes after 
> TikaInputStream.get(): " + inputStream.available());
>             String mimeType = tika.detect(tikaInputStream, metadata);
>             // Debug: Check state after detection
>             System.out.println("Original InputStream available bytes after 
> tika.detect(): " + inputStream.available());
>             System.out.println("TikaInputStream available bytes after 
> tika.detect(): " + tikaInputStream.available());
>             if (inputStream.available() == 0) {
>                 throw new IllegalStateException("InputStream is empty after 
> TikaInputStream creation");
>             }
>         } catch (Exception e) {
>             System.out.printf("Mime check exception for file '%s': [%s]%n", 
> fileName, e.getMessage());
>         }
>     }
> }{code}
> After testing version 3.2.1, the issue is fixed for most file extensions, but 
> .doc, .ppt and .xls extensions are still failing. Find sample files attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] (TIKA-4441) InputStream is consumed by Tika.detect for certain files

Reply via email to