Hi, On Wed, Aug 31, 2011 at 2:25 PM, Michael McCandless <luc...@mikemccandless.com> wrote: > I think there are actually times when the other parse methods do close > the input, eg if the parser wraps the incoming InputStream in a > TikaInputStream, and then uses .getFile(), we copy the file contents > to a temp file and close the original InputStream right?
Right. The idea behind this is that we expect the client to call the close() method of the TikaInputStream instead of the wrapped InputStream. And since we assume that close() will (or should) eventually be called, we actually do so in advance in the getFile() method to release resources as soon as they are no longer needed (i.e. now the stream has been spooled to a temporary file). > If the incoming InputStream is wrapped as TikaInputStream but .getFile() > is not used, we in fact still close the incoming InputStream in the > afterRead() method once one of the read() methods returns -1 (EOF), if > the mark was not set? Yes, the afterRead() method is another example of trying to release the underlying stream as early as possible (i.e. when the stream has been fully exhausted and can't be rewinded anymore). Unfortunately the above assumptions actually don't always hold true. :-( See the discussion on TemporaryFiles below. > Also, separately, I'm confused by the TikaInputStream.get method that > takes a TemporaryFiles instance; the javadocs state: > > * Use this method instead of the {@link #get(InputStream)} alternative > * when you <em>don't</em> explicitly close the returned stream. The > * recommended access pattern is: > * <pre> > * TemporaryFiles tmp = new TemporaryFiles(); > * try { > * TikaInputStream stream = TikaInputStream.get(..., tmp); > * // process stream but don't close it > * } finally { > * tmp.dispose(); > * } > * </pre> > > What confuses me is the tmp.dispose() method only deletes any created > temporary files... it doesn't close any streams that had been opened > against those files? How is the temp file's stream closed in this > case (if .getFile() was called)? Are we relying on afterRead doing > the closing? Good point, that part of the logic still needs some thinking. The basic idea behind the TemporaryFiles construct is to make it easier to cast a given stream to a TikaInputStream instance in cases (like within a Parser or Detector implementation) where we shouldn't call the close() method of the given stream. If the given stream already is a TikaInputStream, we should use it as-is and rely on the caller to properly close() the stream (and thus release any related resources) when it's no longer used. But when we're given some other kind of a stream we need to create a new TikaInputStream instance that we do need to close when we're done using it. And remember that when closing such a new TikaInputStream instance, the underlying stream should *not* get closed. The relevant code would end up something complicated and error-prone like this: boolean isTikaInputStream = TikaInputStream.isTikaInputStream(stream); TikaInputStream tis; if (isTikaInputStream) { tis = TikaInputStream.get(stream); } else { tis = TikaInputStream.get(new CloseShieldInputStream(stream)); } try { // use the stream } finally { if (!isTikaInputStream) { tis.close(); } } The TemporaryFiles class was designed to address this problem in TIKA-567. The basic idea was that we'd use the TemporaryFiles object to track any extra resources (originally just temporary files) that a possible temporary TikaInputStream instance in the above case would generate. Then even if we don't call close() we could release all the resources associated with that temporary TikaInputStream with the code pattern shown in the javadoc snippet you quoted. That I think is a pretty nice improvement over the above code. Of course the problem here is that the current TemporaryFiles class only worries about temporary files, so it won't close the possible temporary stream created in getFile(). And even more worryingly the getFile() or afterRead() methods of the temporary TikaInputStream instance could still end up closing the underlying stream even though that's exactly what we're trying to avoid with this construct. BR, Jukka Zitting