Alvaro created TIKA-4441:
----------------------------

             Summary: InputStream is consumed by Tika.detect for certain files
                 Key: TIKA-4441
                 URL: https://issues.apache.org/jira/browse/TIKA-4441
             Project: Tika
          Issue Type: Bug
    Affects Versions: 3.2.0, 3.2.1
            Reporter: Alvaro
         Attachments: Test.doc, Test.ppt, Test.xls

Hello,
We've been using Tika version 3.1.0 to successfully detect MimeTypes from files 
before uploading them to our S3.
However, after v3.2.0 upgrade, we've noticed that the original inputStream is 
being consumed entirely for certain file extensions.
The affected extensions seem to be all for Microsoft files, pointing us to the 
POIFSContainerDetector, which was actually changed for this release. 
This is the list of extensions we've tested with errors: doc, docx, odt, ppt, 
pptx, xls, xlsx
And these ones work as before: bmp, csv, gif, jpeg, jpg, pdf, png, rtf, svg, txt
 
Here's some code to reproduce the issue:
{code:java}
 class TikaBugReport {

    // affected extensions: doc, docx, odt, ppt, pptx, xls, xlsx 
    public static void main(String[] args) throws IOException {
        String fileName = "Test.docx";
        InputStream inputStream = new 
ClassPathResource(fileName).getInputStream();
        checkFileMime(inputStream, fileName);
    }

    public static void checkFileMime(InputStream inputStream, String fileName) {
        try {
            Tika tika = new Tika();
            System.out.println("InputStream available bytes before processing: 
" + inputStream.available());
            System.out.println("InputStream supports mark: " + 
inputStream.markSupported());

            Metadata metadata = new Metadata();

            TikaInputStream tikaInputStream = TikaInputStream.get(inputStream);
            System.out.println("Original InputStream available bytes after 
TikaInputStream.get(): " + inputStream.available());

            String mimeType = tika.detect(tikaInputStream, metadata);

            // Debug: Check state after detection
            System.out.println("Original InputStream available bytes after 
tika.detect(): " + inputStream.available());
            System.out.println("TikaInputStream available bytes after 
tika.detect(): " + tikaInputStream.available());
            if (inputStream.available() == 0) {
                throw new IllegalStateException("InputStream is empty after 
TikaInputStream creation");
            }

        } catch (Exception e) {
            System.out.printf("Mime check exception for file '%s': [%s]%n", 
fileName, e.getMessage());
        }
    }
}{code}


After testing version 3.2.1, the issue is fixed for most file extensions, but 
.doc, .ppt and .xls extensions are still failing. Find sample files attached
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to