On digging more into this I hit some questions/confusion:

I think there are actually times when the other parse methods do close
the input, eg if the parser wraps the incoming InputStream in a
TikaInputStream, and then uses .getFile(), we copy the file contents
to a temp file and close the original InputStream right?  If the
incoming InputStream is wrapped as TikaInputStream but .getFile() is
not used, we in fact still close the incoming InputStream in the
afterRead() method once one of the read() methods returns -1 (EOF), if
the mark was not set?

Also, separately, I'm confused by the TikaInputStream.get method that
takes a TemporaryFiles instance; the javadocs state:

     * Use this method instead of the {@link #get(InputStream)} alternative
     * when you <em>don't</em> explicitly close the returned stream. The
     * recommended access pattern is:
     * <pre>
     * TemporaryFiles tmp = new TemporaryFiles();
     * try {
     *     TikaInputStream stream = TikaInputStream.get(..., tmp);
     *     // process stream but don't close it
     * } finally {
     *     tmp.dispose();
     * }
     * </pre>

What confuses me is the tmp.dispose() method only deletes any created
temporary files... it doesn't close any streams that had been opened
against those files?  How is the temp file's stream closed in this
case (if .getFile() was called)?  Are we relying on afterRead doing
the closing?

When I added an assert in TemporaryFiles.dispose (asserting that the
file in fact was deleted), this assert causes various test failures on
Windows, which I think is because somewhere we are not in fact closing
the opened streams.. .I'll dig.  This results in many leftover
apache-tika-XXX.tmp files each time I run Tika's tests...

Even on Unix, where the delete will "work" (the OS will "delete on
final close"), I still see apache-tika-XXX.tmp files leftover, which I
think means somewhere we are failing to call .dispose().  It'd be nice
to somehow assert, in each test tearDown, that any created
TemporaryFile instance always had its dispose method called.

Sorry for all the peppered questions... still coming up to speed /
digging here,

Mike McCandless

http://blog.mikemccandless.com

Reply via email to