[ https://issues.apache.org/jira/browse/TIKA-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Burch updated TIKA-2585: ----------------------------- Description: As raised in the 2.0 breaking changes thread, currently the only way that Tika has of handling the need to fully read an InputStream multiple times is to use {{TikaInputStream.getFile()}} which will spool to a temp file if not already file-based. (Reading a few kb is handled via buffering and mark/reset, but that doesn't scale for huge full files) In some cases, grabbing a fresh {{InputStream}} is actually cheaper than Tika spooling to a temp file, but we've no way of a caller expressing that So, before we make too much extra use of re-processing the whole input several times (eg for the augmenting-parsers and fallback-parsers), we should provide a way for callers to instead supply new {{InputStream}} instances on demand was: As raised in the 2.0 breaking changes thread, currently the only way that Tika has of handling the need to fully read an InputStream multiple times is to use `TikaInputStream.getFile()` which will spool to a temp file if not already file-based. (Reading a few kb is handled via buffering and mark/reset, but that doesn't scale for huge full files) In some cases, grabbing a fresh `InputStream` is actually cheaper than Tika spooling to a temp file, but we've no way of a caller expressing that So, before we make too much extra use of re-processing the whole input several times (eg for the augmenting-parsers and fallback-parsers), we should provide a way for callers to instead supply new InputStream instances on demand > TikaInputStream support for resetting via a factory of InputStreams > ------------------------------------------------------------------- > > Key: TIKA-2585 > URL: https://issues.apache.org/jira/browse/TIKA-2585 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 2.0, 1.17 > Reporter: Nick Burch > Priority: Major > > As raised in the 2.0 breaking changes thread, currently the only way that > Tika has of handling the need to fully read an InputStream multiple times is > to use {{TikaInputStream.getFile()}} which will spool to a temp file if not > already file-based. (Reading a few kb is handled via buffering and > mark/reset, but that doesn't scale for huge full files) > In some cases, grabbing a fresh {{InputStream}} is actually cheaper than Tika > spooling to a temp file, but we've no way of a caller expressing that > So, before we make too much extra use of re-processing the whole input > several times (eg for the augmenting-parsers and fallback-parsers), we should > provide a way for callers to instead supply new {{InputStream}} instances on > demand -- This message was sent by Atlassian JIRA (v7.6.3#76005)