On Mon, 8 May 2023 13:46:55 GMT, Maurizio Cimadamore <mcimadam...@openjdk.org> wrote:
> This is a tricky intermittent failure on one of our async file channel test. > > While the logic of the test is a bit hard to follow, I believe the test is > supposed to work as follows: > > * in the main thread, we submit an initial async write on a buffer > * the completion handler of the async op will submit another async write > after one completes > * before/after each write we increment/decrement an atomic counter > * in the main thread waits 20 secs, to make sure the write buffers are full > and that there are some async writes actually awaiting for OS to serve them > * we then try to close the test arena - this would fail as there are pending > OS writes, and we keep the arena locked to prevent JVM crashes > * then the main test will set the `continueWriting` flag to false (which will > cause the handler to stop submitting new write requests). > * the main thread will then read what's left (so that the buffers will empty > and the OS can serve the outstanding writes) > * when there's no outstanding writes left, the main thread will close the > test scope. > > I think, after many hours spent staring the test that, when working as > intended, the test logic is correct. Each write is only submitted after the > previous one finished, and the test can only end when we see the number of > outstanding write to reach 0. For this to happen, we need to execute the > handler once when `continueWriting` is set to false (which will cause an > asymmetric decrement of the counter from the handler, which will match the > asymmetric increment outside the handler, in the main test thread). > > When trying to reduce the timeout which ensures that write buffers are full, > I started hitting the same exception as the one described in the bug report. > After some digging, I found that the exception was **not** caused (as I > thought) by some bad synchronization logic which allowed the main test to > close the arena before the handlers were actually finished with it. Instead, > the failure is caused by the test assertion which checks that the test arena > cannot be closed: > > > assertMessage(expectThrows(ISE, () -> drop.close()), "Session is acquired > by"); > > > This check is bogus: it assumes that the buffers are indeed full, and that > some OS write operation cannot progress. In that case, the underlying arena > will be kept alive (as the implementation wants to avoid a JVM crash > triggered by an OS write on an already freed region of memory). > > But, if the buffer is not full at this stage, there is nothing keeping the > test arena alive: note that the completion handler executes **after** the > arena acquire/release in `IOUtils`. So, if all write operations complete > normally, `drop::close` will actually succeed! > > At this point it's a coint toss as to whether we'll see a message because the > copletion handler tries to allocate on an already closed arena, or if we see > a message complaining about `drop::close` not failing as expected. > > While there are other ways to fix this, I think a simple fix is to actually > remove the assertion on `drop::close`. Note that if there's a bug in the > keepalive logic of the arena, the test would still fail (but with a JVM > crash). And, by avoiding a spurious call to `drop::close` we make sure that > the test always runs as intended whether or not the OS buffers are full. Great detective work, thanks! I think the fix is good. We already have other tests that try to detect acquire/release issue. ------------- Marked as reviewed by jvernee (Reviewer). PR Review: https://git.openjdk.org/jdk/pull/13866#pullrequestreview-1417085971