New submission from STINNER Victor: On Windows using the IOCP (proactor) event loop, I noticed race conditions when running the test suite of Trollius. For examples, sometimes the returncode of a process is None, which should never happen. It looks like wait_for_handle() has an invalid behaviour.
When I run the test suite of asyncio in debug mode (PYTHONASYNCIODEBUG=1), sometimes I see the message "GetQueuedCompletionStatus() returned an unexpected event" which should never occur neither. I added debug traces. I saw that the IocpProactor.wait_for_handle() calls later PostQueuedCompletionStatus() through its internal C callback (PostToQueueCallback). It looks like sometimes the callback is called whereas the wait was cancelled/acked with UnregisterWait(). I didn't understand the logic between RegisterWaitForSingleObject(), UnregisterWait() and the callback. It looks like sometimes the overlapped object created in Python ("ov = _overlapped.Overlapped(NULL)") is destroyed, before PostToQueueCallback() was called. In the unit tests, it doesn't crash because a different overlapped object is created and it gets the same memory address (probably because we are lucky!). The current implementation of wait_for_handle() has an optimization: it polls immediatly the wait to check if it already completed. I tried to remove it, but I got some different issues. If I understood correctly, this optimization hides other bugs and reduce the probability of getting the race condition. wait_for_handle() in used to wait for the completion of a subprocess, so by all unit tests running subprocesses, but also in test_wait_for_handle() and test_wait_for_handle_cancel() tests. I suspect that running test_wait_for_handle() or test_wait_for_handle_cancel() schedule the bug. Note: Removing "_winapi.CloseHandle(self._iocp)" in IocpProactor.close() works around the bug. The bug looks to be an expected call to PostToQueueCallback() which calls PostQueuedCompletionStatus() on an IOCP. Not closing the IOCP means using a different IOCP for each test, so the unexpected call to PostQueuedCompletionStatus() has no effect on following tests. -- I rewrote some parts of the IOCP code in asyncio. Maybe I introduced this issue during the refactoring. Maybe it already existed before but nobody noticed it, asyncio had fewer unit tests before. At the beginning, I wanted to fix this crash: https://code.google.com/p/tulip/issues/detail?id=195 "_WaitHandleFuture.cancel() crash if the wait event was already unregistered" Later, I tried to make the code more reliable in this issue: https://code.google.com/p/tulip/issues/detail?id=196 "_OverlappedFuture.set_result() should clear the its reference to the overlapped object" Read Trollius 1.0.1 changelog which lists these changes: http://trollius.readthedocs.org/changelog.html#version-1-0-1 -- Note: The IOCP code still has code which can be enhanced: - "Investigate IocpProactor.accept_pipe() special case (don't register overlapped)" https://code.google.com/p/tulip/issues/detail?id=204 - "Rewrite IocpProactor.connect_pipe() with non-blocking calls to avoid non interruptible QueueUserWorkItem()" https://code.google.com/p/tulip/issues/detail?id=197 ---------- components: Windows, asyncio messages: 232987 nosy: gvanrossum, haypo, steve.dower, tim.golden, yselivanov, zach.ware priority: normal severity: normal status: open title: asyncio: race condition in the IOCP code (proactor event loop) versions: Python 3.4, Python 3.5 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue23095> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com