[ https://issues.apache.org/jira/browse/TIKA-4441 ]
Tilman Hausherr deleted comment on TIKA-4441: --------------------------------------- was (Author: tilman): Some intermediate result {code:java} MediaTypeRegistry defaultRegistry = MediaTypeRegistry.getDefaultRegistry(); MediaType type = MediaType.OCTET_STREAM; for (Detector detector : firstDetector.getDetectors()) { System.out.println("Original InputStream available bytes before tika.detect() with " + detector.getClass().getSimpleName() + ": " + is2.available()); MediaType detected = detector.detect(tikaInputStream, metadata); System.out.println("Original InputStream available bytes after tika.detect() with " + detector.getClass().getSimpleName() + ": " + is2.available() + ", detected: " + detected); if (defaultRegistry.isSpecializationOf(detected, type)) { type = detected; } } System.out.println("type: " + type); {code} output: Original InputStream available bytes before tika.detect() with OggDetector: 98304 Original InputStream available bytes after tika.detect() with OggDetector: 98304, detected: application/octet-stream Original InputStream available bytes before tika.detect() with BPListDetector: 98304 Original InputStream available bytes after tika.detect() with BPListDetector: 98304, detected: application/octet-stream Original InputStream available bytes before tika.detect() with GZipSpecializationDetector: 98304 Original InputStream available bytes after tika.detect() with GZipSpecializationDetector: 98304, detected: application/octet-stream Original InputStream available bytes before tika.detect() with POIFSContainerDetector: 98304 Original InputStream available bytes after tika.detect() with POIFSContainerDetector: 0, detected: application/msword Original InputStream available bytes before tika.detect() with MiscOLEDetector: 0 Original InputStream available bytes after tika.detect() with MiscOLEDetector: 0, detected: application/x-tika-msoffice Original InputStream available bytes before tika.detect() with DefaultZipContainerDetector: 0 Original InputStream available bytes after tika.detect() with DefaultZipContainerDetector: 0, detected: application/octet-stream Original InputStream available bytes before tika.detect() with MimeTypes: 0 Original InputStream available bytes after tika.detect() with MimeTypes: 0, detected: application/x-tika-msoffice type: application/msword So I guess we'll have a look at {{POIFSContainerDetector}} > InputStream is consumed by Tika.detect for certain files > -------------------------------------------------------- > > Key: TIKA-4441 > URL: https://issues.apache.org/jira/browse/TIKA-4441 > Project: Tika > Issue Type: Bug > Affects Versions: 3.2.0, 3.2.1 > Reporter: Alvaro > Priority: Major > Attachments: Test.doc, Test.ppt, Test.xls > > > Hello, > We've been using Tika version 3.1.0 to successfully detect MimeTypes from > files before uploading them to our S3. > However, after v3.2.0 upgrade, we've noticed that the original inputStream is > being consumed entirely for certain file extensions. > The affected extensions seem to be all for Microsoft files, pointing us to > the POIFSContainerDetector, which was actually changed for this release. > This is the list of extensions we've tested with errors: doc, docx, odt, ppt, > pptx, xls, xlsx > And these ones work as before: bmp, csv, gif, jpeg, jpg, pdf, png, rtf, svg, > txt > > Here's some code to reproduce the issue: > {code:java} > class TikaBugReport { > // affected extensions: doc, docx, odt, ppt, pptx, xls, xlsx > public static void main(String[] args) throws IOException { > String fileName = "Test.docx"; > InputStream inputStream = new > ClassPathResource(fileName).getInputStream(); > checkFileMime(inputStream, fileName); > } > public static void checkFileMime(InputStream inputStream, String > fileName) { > try { > Tika tika = new Tika(); > System.out.println("InputStream available bytes before > processing: " + inputStream.available()); > System.out.println("InputStream supports mark: " + > inputStream.markSupported()); > Metadata metadata = new Metadata(); > TikaInputStream tikaInputStream = > TikaInputStream.get(inputStream); > System.out.println("Original InputStream available bytes after > TikaInputStream.get(): " + inputStream.available()); > String mimeType = tika.detect(tikaInputStream, metadata); > // Debug: Check state after detection > System.out.println("Original InputStream available bytes after > tika.detect(): " + inputStream.available()); > System.out.println("TikaInputStream available bytes after > tika.detect(): " + tikaInputStream.available()); > if (inputStream.available() == 0) { > throw new IllegalStateException("InputStream is empty after > TikaInputStream creation"); > } > } catch (Exception e) { > System.out.printf("Mime check exception for file '%s': [%s]%n", > fileName, e.getMessage()); > } > } > }{code} > After testing version 3.2.1, the issue is fixed for most file extensions, but > .doc, .ppt and .xls extensions are still failing. Find sample files attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)