[ https://issues.apache.org/jira/browse/TIKA-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17985773#comment-17985773 ]
Alvaro commented on TIKA-4441: ------------------------------ It was working like that in version 3.1.0. If there are any alternatives, I'm happy to try them > InputStream is consumed by Tika.detect for certain files > -------------------------------------------------------- > > Key: TIKA-4441 > URL: https://issues.apache.org/jira/browse/TIKA-4441 > Project: Tika > Issue Type: Bug > Affects Versions: 3.2.0, 3.2.1 > Reporter: Alvaro > Priority: Major > Attachments: Test.doc, Test.ppt, Test.xls > > > Hello, > We've been using Tika version 3.1.0 to successfully detect MimeTypes from > files before uploading them to our S3. > However, after v3.2.0 upgrade, we've noticed that the original inputStream is > being consumed entirely for certain file extensions. > The affected extensions seem to be all for Microsoft files, pointing us to > the POIFSContainerDetector, which was actually changed for this release. > This is the list of extensions we've tested with errors: doc, docx, odt, ppt, > pptx, xls, xlsx > And these ones work as before: bmp, csv, gif, jpeg, jpg, pdf, png, rtf, svg, > txt > > Here's some code to reproduce the issue: > {code:java} > class TikaBugReport { > // affected extensions: doc, docx, odt, ppt, pptx, xls, xlsx > public static void main(String[] args) throws IOException { > String fileName = "Test.docx"; > InputStream inputStream = new > ClassPathResource(fileName).getInputStream(); > checkFileMime(inputStream, fileName); > } > public static void checkFileMime(InputStream inputStream, String > fileName) { > try { > Tika tika = new Tika(); > System.out.println("InputStream available bytes before > processing: " + inputStream.available()); > System.out.println("InputStream supports mark: " + > inputStream.markSupported()); > Metadata metadata = new Metadata(); > TikaInputStream tikaInputStream = > TikaInputStream.get(inputStream); > System.out.println("Original InputStream available bytes after > TikaInputStream.get(): " + inputStream.available()); > String mimeType = tika.detect(tikaInputStream, metadata); > // Debug: Check state after detection > System.out.println("Original InputStream available bytes after > tika.detect(): " + inputStream.available()); > System.out.println("TikaInputStream available bytes after > tika.detect(): " + tikaInputStream.available()); > if (inputStream.available() == 0) { > throw new IllegalStateException("InputStream is empty after > TikaInputStream creation"); > } > } catch (Exception e) { > System.out.printf("Mime check exception for file '%s': [%s]%n", > fileName, e.getMessage()); > } > } > }{code} > After testing version 3.2.1, the issue is fixed for most file extensions, but > .doc, .ppt and .xls extensions are still failing. Find sample files attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)