Nick Burch created TIKA-2585:
--------------------------------

             Summary: TikaInputStream support for resetting via a factory of 
InputStreams
                 Key: TIKA-2585
                 URL: https://issues.apache.org/jira/browse/TIKA-2585
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.17, 2.0
            Reporter: Nick Burch


As raised in the 2.0 breaking changes thread, currently the only way that Tika 
has of handling the need to fully read an InputStream multiple times is to use 
`TikaInputStream.getFile()` which will spool to a temp file if not already 
file-based. (Reading a few kb is handled via buffering and mark/reset, but that 
doesn't scale for huge full files)

In some cases, grabbing a fresh `InputStream` is actually cheaper than Tika 
spooling to a temp file, but we've no way of a caller expressing that

So, before we make too much extra use of re-processing the whole input several 
times (eg for the augmenting-parsers and fallback-parsers), we should provide a 
way for callers to instead supply new InputStream instances on demand



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to