[ https://issues.apache.org/jira/browse/TIKA-4441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17986196#comment-17986196 ]
Tim Allison commented on TIKA-4441: ----------------------------------- {noformat} @Test public void testDetector() throws Exception { String config = """ <?xml version="1.0" encoding="UTF-8" standalone="no"?> <properties> <detectors> <detector class="org.gagravarr.tika.OggDetector"/> <detector class="org.apache.tika.detect.apple.BPListDetector"/> <detector class="org.apache.tika.detect.gzip.GZipSpecializationDetector"/> <detector class="org.apache.tika.detect.microsoft.POIFSContainerDetector"> <params> <param name="markLimit" type="int">120</param> </params> </detector> <detector class="org.apache.tika.detect.ole.MiscOLEDetector"/> <detector class="org.apache.tika.detect.zip.DefaultZipContainerDetector"> <params> <param name="markLimit" type="int">16777216</param> </params> </detector> <detector class="org.apache.tika.mime.MimeTypes"/> </detectors> </properties> """; TikaConfig tikaConfig = new TikaConfig(new ByteArrayInputStream(config.getBytes(StandardCharsets.UTF_8))); Tika tika = new Tika(tikaConfig); try (InputStream is = new URI("https://issues.apache.org/jira/secure/attachment/13077181/Test.doc").toURL().openStream()) { byte[] ba = is.readAllBytes(); is.close(); InputStream is2 = new ByteArrayInputStream(ba); TikaInputStream tikaInputStream = TikaInputStream.get(is2); System.out.println("InputStream available bytes before processing: " + is2.available()); System.out.println("InputStream supports mark: " + is2.markSupported()); System.out.println("Tika version: " + new Tika()); String detected2 = tika.detect(tikaInputStream, new Metadata()); System.out.println("Original InputStream available bytes after detect(): " + is2.available() + ", detected: " + detected2); ByteArrayOutputStream baos = new ByteArrayOutputStream(); int c = is2.read(); while (c > -1) { baos.write(c); c = is2.read(); } System.out.println("ACTUALLY READ " + baos.size()); } {noformat} > InputStream is consumed by Tika.detect for certain files > -------------------------------------------------------- > > Key: TIKA-4441 > URL: https://issues.apache.org/jira/browse/TIKA-4441 > Project: Tika > Issue Type: Bug > Affects Versions: 3.2.0, 3.2.1 > Reporter: Alvaro > Priority: Major > Attachments: Test.doc, Test.ppt, Test.xls > > > Hello, > We've been using Tika version 3.1.0 to successfully detect MimeTypes from > files before uploading them to our S3. > However, after v3.2.0 upgrade, we've noticed that the original inputStream is > being consumed entirely for certain file extensions. > The affected extensions seem to be all for Microsoft files, pointing us to > the POIFSContainerDetector, which was actually changed for this release. > This is the list of extensions we've tested with errors: doc, docx, odt, ppt, > pptx, xls, xlsx > And these ones work as before: bmp, csv, gif, jpeg, jpg, pdf, png, rtf, svg, > txt > > Here's some code to reproduce the issue: > {code:java} > class TikaBugReport { > // affected extensions: doc, docx, odt, ppt, pptx, xls, xlsx > public static void main(String[] args) throws IOException { > String fileName = "Test.docx"; > InputStream inputStream = new > ClassPathResource(fileName).getInputStream(); > checkFileMime(inputStream, fileName); > } > public static void checkFileMime(InputStream inputStream, String > fileName) { > try { > Tika tika = new Tika(); > System.out.println("InputStream available bytes before > processing: " + inputStream.available()); > System.out.println("InputStream supports mark: " + > inputStream.markSupported()); > Metadata metadata = new Metadata(); > TikaInputStream tikaInputStream = > TikaInputStream.get(inputStream); > System.out.println("Original InputStream available bytes after > TikaInputStream.get(): " + inputStream.available()); > String mimeType = tika.detect(tikaInputStream, metadata); > // Debug: Check state after detection > System.out.println("Original InputStream available bytes after > tika.detect(): " + inputStream.available()); > System.out.println("TikaInputStream available bytes after > tika.detect(): " + tikaInputStream.available()); > if (inputStream.available() == 0) { > throw new IllegalStateException("InputStream is empty after > TikaInputStream creation"); > } > } catch (Exception e) { > System.out.printf("Mime check exception for file '%s': [%s]%n", > fileName, e.getMessage()); > } > } > }{code} > After testing version 3.2.1, the issue is fixed for most file extensions, but > .doc, .ppt and .xls extensions are still failing. Find sample files attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)