On 16/07/2015 21:05, Richard W.M. Jones wrote: > > Sorry to spoil things, but I'm still seeing this bug, although it is > now a lot less frequent with your patch. I would estimate it happens > more often than 1 in 5 runs with qemu.git, and probably 1 in 200 runs > with qemu.git + the v2 patch series. > > It's the exact same hang in both cases. > > Is it possible that this patch doesn't completely close any race? > > Still, it is an improvement, so there is that.
Would seem at first glance like a different bug. Interestingly, adding some "tracing" (qemu_clock_get_ns) makes the bug more likely: now it reproduces in about 10 tries. Of course :) adding other kinds of tracing instead make it go away again (>50 tries). Perhaps this: i/o thread vcpu thread worker thread --------------------------------------------------------------------- lock_iothread notify_me = 1 ... unlock_iothread lock_iothread notify_me = 3 ppoll notify_me = 1 bh->scheduled = 1 event_notifier_set event_notifier_test_and_clear ppoll ^^ hang In the exact shape above, it doesn't seem too likely to happen, but perhaps there's another simpler case. Still, the bug exists. The above is not really related to notify_me. Here the notification is not being optimized away! So I wonder if this one has been there forever. Fam suggested putting the event_notifier_test_and_clear before aio_bh_poll(), but it does not work. I'll look more close However, an unconditional event_notifier_test_and_clear is pretty expensive. On one hand, obviously correctness comes first. On the other hand, an expensive operation at the wrong place can mask the race very easily; I'll let the fix run for a while, but I'm not sure if a successful test really says anything useful. Paolo