Alvaro created TIKA-4441: ---------------------------- Summary: InputStream is consumed by Tika.detect for certain files Key: TIKA-4441 URL: https://issues.apache.org/jira/browse/TIKA-4441 Project: Tika Issue Type: Bug Affects Versions: 3.2.0, 3.2.1 Reporter: Alvaro Attachments: Test.doc, Test.ppt, Test.xls
Hello, We've been using Tika version 3.1.0 to successfully detect MimeTypes from files before uploading them to our S3. However, after v3.2.0 upgrade, we've noticed that the original inputStream is being consumed entirely for certain file extensions. The affected extensions seem to be all for Microsoft files, pointing us to the POIFSContainerDetector, which was actually changed for this release. This is the list of extensions we've tested with errors: doc, docx, odt, ppt, pptx, xls, xlsx And these ones work as before: bmp, csv, gif, jpeg, jpg, pdf, png, rtf, svg, txt Here's some code to reproduce the issue: {code:java} class TikaBugReport { // affected extensions: doc, docx, odt, ppt, pptx, xls, xlsx public static void main(String[] args) throws IOException { String fileName = "Test.docx"; InputStream inputStream = new ClassPathResource(fileName).getInputStream(); checkFileMime(inputStream, fileName); } public static void checkFileMime(InputStream inputStream, String fileName) { try { Tika tika = new Tika(); System.out.println("InputStream available bytes before processing: " + inputStream.available()); System.out.println("InputStream supports mark: " + inputStream.markSupported()); Metadata metadata = new Metadata(); TikaInputStream tikaInputStream = TikaInputStream.get(inputStream); System.out.println("Original InputStream available bytes after TikaInputStream.get(): " + inputStream.available()); String mimeType = tika.detect(tikaInputStream, metadata); // Debug: Check state after detection System.out.println("Original InputStream available bytes after tika.detect(): " + inputStream.available()); System.out.println("TikaInputStream available bytes after tika.detect(): " + tikaInputStream.available()); if (inputStream.available() == 0) { throw new IllegalStateException("InputStream is empty after TikaInputStream creation"); } } catch (Exception e) { System.out.printf("Mime check exception for file '%s': [%s]%n", fileName, e.getMessage()); } } }{code} After testing version 3.2.1, the issue is fixed for most file extensions, but .doc, .ppt and .xls extensions are still failing. Find sample files attached -- This message was sent by Atlassian Jira (v8.20.10#820010)