[ https://issues.apache.org/jira/browse/TIKA-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372063#comment-16372063 ]
Nick Burch commented on TIKA-2585: ---------------------------------- I can't immediately see a common / well known class/interface we could accept - Spring has {{InputStreamSource}} in it's core, but that's potentially a huge dependency to suck in for just one class. Various other libraries define their own {{InputStreamFactory}}, but I can't seem to find any of those in tiny libraries / the JVM core itself. Before we have to create our own class/interface for passing this to {{TikaInputStream}}, does anyone know of a good one we can re-use? > TikaInputStream support for resetting via a factory of InputStreams > ------------------------------------------------------------------- > > Key: TIKA-2585 > URL: https://issues.apache.org/jira/browse/TIKA-2585 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 2.0, 1.17 > Reporter: Nick Burch > Priority: Major > > As raised in the 2.0 breaking changes thread, currently the only way that > Tika has of handling the need to fully read an InputStream multiple times is > to use {{TikaInputStream.getFile()}} which will spool to a temp file if not > already file-based. (Reading a few kb is handled via buffering and > mark/reset, but that doesn't scale for huge full files) > In some cases, grabbing a fresh {{InputStream}} is actually cheaper than Tika > spooling to a temp file, but we've no way of a caller expressing that > So, before we make too much extra use of re-processing the whole input > several times (eg for the augmenting-parsers and fallback-parsers), we should > provide a way for callers to instead supply new {{InputStream}} instances on > demand -- This message was sent by Atlassian JIRA (v7.6.3#76005)