On digging more into this I hit some questions/confusion: I think there are actually times when the other parse methods do close the input, eg if the parser wraps the incoming InputStream in a TikaInputStream, and then uses .getFile(), we copy the file contents to a temp file and close the original InputStream right? If the incoming InputStream is wrapped as TikaInputStream but .getFile() is not used, we in fact still close the incoming InputStream in the afterRead() method once one of the read() methods returns -1 (EOF), if the mark was not set?
Also, separately, I'm confused by the TikaInputStream.get method that takes a TemporaryFiles instance; the javadocs state: * Use this method instead of the {@link #get(InputStream)} alternative * when you <em>don't</em> explicitly close the returned stream. The * recommended access pattern is: * <pre> * TemporaryFiles tmp = new TemporaryFiles(); * try { * TikaInputStream stream = TikaInputStream.get(..., tmp); * // process stream but don't close it * } finally { * tmp.dispose(); * } * </pre> What confuses me is the tmp.dispose() method only deletes any created temporary files... it doesn't close any streams that had been opened against those files? How is the temp file's stream closed in this case (if .getFile() was called)? Are we relying on afterRead doing the closing? When I added an assert in TemporaryFiles.dispose (asserting that the file in fact was deleted), this assert causes various test failures on Windows, which I think is because somewhere we are not in fact closing the opened streams.. .I'll dig. This results in many leftover apache-tika-XXX.tmp files each time I run Tika's tests... Even on Unix, where the delete will "work" (the OS will "delete on final close"), I still see apache-tika-XXX.tmp files leftover, which I think means somewhere we are failing to call .dispose(). It'd be nice to somehow assert, in each test tearDown, that any created TemporaryFile instance always had its dispose method called. Sorry for all the peppered questions... still coming up to speed / digging here, Mike McCandless http://blog.mikemccandless.com