On Tue, 29 Oct 2024 at 08:42, Akihiko Odaki <akihiko.od...@daynix.com> wrote:
> On 2024/10/29 6:06, Phil Dennis-Jordan wrote: > > > > > > On Mon, 28 Oct 2024 at 17:06, Akihiko Odaki <akihiko.od...@daynix.com > > <mailto:akihiko.od...@daynix.com>> wrote: > > > > On 2024/10/28 23:13, Phil Dennis-Jordan wrote: > > > > > > > > > On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki > > <akihiko.od...@daynix.com <mailto:akihiko.od...@daynix.com> > > > <mailto:akihiko.od...@daynix.com > > <mailto:akihiko.od...@daynix.com>>> wrote: > > > > > > On 2024/10/28 22:31, Phil Dennis-Jordan wrote: > > > > > > > > > > > > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan > > > <p...@philjordan.eu <mailto:p...@philjordan.eu> > > <mailto:p...@philjordan.eu <mailto:p...@philjordan.eu>> > > > > <mailto:p...@philjordan.eu <mailto:p...@philjordan.eu> > > <mailto:p...@philjordan.eu <mailto:p...@philjordan.eu>>>> wrote: > > > > > > > > > > > > > > > > > > > > Hmm. I think if we were to use that, we > > would > > > need to > > > > create a new > > > > > > QemuEvent for every job and destroy it > > afterward, > > > > which seems > > > > > expensive. > > > > > > We can't rule out multiple concurrent > > jobs being > > > > submitted, and the > > > > > > QemuEvent system only supports a single > > producer as > > > > far as I can > > > > > tell. > > > > > > > > > > > > You can probably sort of hack around it > with > > > just one > > > > QemuEvent by > > > > > > putting the qemu_event_wait into a loop > > and turning > > > > the job.done > > > > > flag > > > > > > into an atomic (because it would now > > need to be > > > > checked outside the > > > > > > lock) but this all seems unnecessarily > > complicated > > > > considering the > > > > > > QemuEvent uses the same mechanism > QemuCond/ > > > QemuMutex > > > > internally > > > > > on macOS > > > > > > (the only platform relevant here), > except we > > > can use it as > > > > > intended with > > > > > > QemuCond/QemuMutex rather than having to > > work > > > against the > > > > > abstraction. > > > > > > > > > > I don't think it's going to be used > > concurrently. It > > > > would be difficult > > > > > to reason even for the framework if it > > performs memory > > > > > unmapping/mapping/reading operations > > concurrently. > > > > > > > > > > > > > > > I've just performed a very quick test by > > wrapping the job > > > > submission/ > > > > > wait in the 2 mapMemory callbacks and the 1 > > readMemory > > > > callback with > > > > > atomic counters and logging whenever a counter > went > > > above 1. > > > > > > > > > > * Overall, concurrent callbacks across all > > types were > > > > common (many per > > > > > second when the VM is busy). It's not exactly a > > > "thundering > > > > herd" (I > > > > > never saw >2) but it's probably not a bad idea > > to use > > > a separate > > > > > condition variable for each job type. (task map, > > > surface map, > > > > memory read) > > > > > * While I did not observe any concurrent > > memory mapping > > > > operations > > > > > *within* a type of memory map (2 task mappings > or 2 > > > surface > > > > mappings) I > > > > > did see very occasional concurrent memory *read* > > > callbacks. > > > > These would, > > > > > as far as I can tell, not be safe with > QemuEvents, > > > unless we > > > > placed the > > > > > event inside the job struct and init/destroyed > > it on every > > > > callback > > > > > (which seems like excessive overhead). > > > > > > > > I think we can tolerate that overhead. init/destroy > > > essentially > > > > sets the > > > > fields in the data structure and I estimate its > > total size is > > > > about 100 > > > > bytes. It is probably better than waking an > > irrelevant thread > > > > up. I also > > > > hope that keeps the code simple; it's not > worthwhile > > > adding code to > > > > optimize this. > > > > > > > > > > > > At least pthread_cond_{init,destroy} and > > > > pthread_mutex_{init,destroy} don't make any syscalls, > > so yeah > > > it's > > > > probably an acceptable overhead. > > > > > > > > > > > > I've just experimented with QemuEvents created on-demand > > and ran > > > into > > > > some weird deadlocks, which then made me sit down and > > think about it > > > > some more. I've come to the conclusion that creating (and > > crucially, > > > > destroying) QemuEvents on demand in this way is not safe. > > > > > > > > Specifically, you must not call qemu_event_destroy() - > which > > > > transitively destroys the mutex and condition variable - > > unless > > > you can > > > > guarantee that the qemu_event_set() call on that event > > object has > > > completed. > > > > > > > > In qemu_event_set, the event object's value is atomically > > set to > > > EV_SET. > > > > If the previous value was EV_BUSY, qemu_futex_wake() is > > called. > > > All of > > > > this is outside any mutex, however, so apart from memory > > coherence > > > > (there are barriers) this can race with the waiting thread. > > > > qemu_event_wait() reads the event's value. If EV_FREE, it's > > > atomically > > > > set to EV_BUSY. Then the mutex is locked, the value is > checked > > > again, > > > > and if it's still EV_BUSY, it waits for the condition > > variable, > > > > otherwise the mutex is immediately unlocked again. If the > > trigger > > > > thread's qemu_event_set() flip to EV_SET occurs between > > the waiting > > > > thread's two atomic reads of the value, the waiting thread > > will > > > never > > > > wait for the condition variable, but the trigger thread > > WILL try to > > > > acquire the mutex and signal the condition variable in > > > > qemu_futex_wake(), by which time the waiting thread may > have > > > advanced > > > > outside of qemu_event_wait(). > > > > > > Sorry if I'm making a mistake again, but the waiting thread > won't > > > set to > > > EV_BUSY unless the value is EV_FREE on the second read so the > > trigger > > > thread will not call qemu_futex_wake() if it manages to set > > to EV_SET > > > before the second read, will it? > > > > > > > > > This sequence of events will cause the problem: > > > > > > WAITER (in qemu_event_wait): > > > value = qatomic_load_acquire(&ev->value); > > > -> EV_FREE > > > > > > TRIGGER (in qemu_event_set): > > > qatomic_read(&ev->value) != EV_SET > > > -> EV_FREE (condition is false) > > > > > > WAITER: > > > qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET > > > -> cmpxchg returns EV_FREE, condition false. > > > ev->value = EV_BUSY. > > > > TRIGGER: > > > int old = qatomic_xchg(&ev->value, EV_SET); > > > smp_mb__after_rmw(); > > > if (old == EV_BUSY) { > > > -> old = EV_BUSY, condition true. > > > ev->value = EV_SET > > > > > > WAITER (in qemu_futex_wait(ev, EV_BUSY)): > > > pthread_mutex_lock(&ev->lock); > > > if (ev->value == val) { > > > -> false, because value is EV_SET > > > > > > WAITER: > > > pthread_mutex_unlock(&ev->lock); > > > … > > > qemu_event_destroy(&job->done_event); > > > > > > TRIGGER (in qemu_futex_wake(ev, INT_MAX)): > > > pthread_mutex_lock(&ev->lock); > > > -> hangs, because mutex has been destroyed > > > > Thanks for clarification. This is very insightful. > > > > > > > > > > > > > > > This is all fine usually, BUT if you destroy the QemuEvent > > > immediately > > > > after the qemu_event_wait() call, qemu_futex_wake() may > try to > > > lock a > > > > mutex that has been destroyed, or signal a condition > variable > > > which has > > > > been destroyed. I don't see a reasonable way of making > > this safe > > > other > > > > than using long-lived mutexes and condition variables. And > > > anyway, we > > > > have much, MUCH bigger contention/performance issues > > coming from > > > almost > > > > everything being covered by the BQL. (If waking these > > callbacks > > > can even > > > > be considered an issue: I haven't seen it show up in > > profiling, > > > whereas > > > > BQL contention very much does.) > > > > > > > > I'll submit v5 of this patch set with separate condition > > > variables for > > > > each job type. This should make the occurrence of waking > > the wrong > > > > thread quite rare, while reasoning about correctness is > pretty > > > > straightforward. I think that's good enough. > > > > What about using QemuSemaphore then? It does not seem to have the > > problem same with QemuEvent. > > > > > > Nowhere else in the code base uses short-lived semaphores, and while I > > can't immediately see a risk (the mutex is held during both post and > > wait) there might be some non-obvious problem with the approach. > > Internally, the semaphores use condition variables. The solution using > > condition variables directly already works, is safe, relatively easy to > > reason about, and does not cause any performance issues. There is a tiny > > inefficiency about waking up a thread unnecessarily in the rare case > > when two callbacks of the same kind occur concurrently. In practice, > > it's irrelevant. Thanks to the awkward mismatch of the > > PVGraphics.framework's libdispatch based approach and Qemu's BQL/AIO/BH > > approach, we are already sending messages to other threads very > > frequently. This isn't ideal, but not fixable without drastically > > reducing the need to acquire the BQL across Qemu. > > I found several usage of ephemeral semaphores: > h_random() in hw/ppc/spapr_rng.c > colo_process_checkpoint() in migration/colo.c > postcopy_thread_create() in migration/postcopy-ram.c > > I'm sure short-lived semaphores will keep working (or break migration in > strange ways). > > > > > I do not think it is worth spending even more time trying to fix this > > part of the code which isn't broken in the first place. > > I'm sorry to bring you to this mess, which I didn't really expect. I > thought combining a shared pair of conditional variable and mutex and > job-specific bools is unnecessarily complex, and having one > synchronization primitive for each job will be simpler and will just work. > With multithreading, the devil is always in the detail! 😅 I wouldn't mind if we were seeing genuine issues with the Mutex/Cond code, but it's fine as far as I can tell. The QemuEvent version wasn't even really any simpler (replacing bool done; with QemuEvent done_event; and await -> init/wait/destroy gets longer while lock/broadcast/unlock -> set gets shorter), and I guess a QemuSemaphore version would be about the same. Relying on the way an edge case is handled - destroying immediately after waiting - in the long term potentially makes the code more fragile too in case implementation details change. I think we've reached a bikeshedding stage here, and I suggest any further improvements on this part other than bug fixes should be deferred to future patches. > However, there was a pitfall with QemuEvent as you demonstrated and now > I grep QemuEvent and QemuSemaphore to find all such ephemeral usage is > written with QemuSemaphore instead of QemuEvent. I think the critical > problem here is that it is not documented that qemu_event_destroy() > cannot be used immediately after qemu_event_wait(). We would not run > into this situation if it is written. I will write a patch to add such a > documentation comment. > Sounds good, I'd be happy to review it (cc me). Thanks for the in-depth reviews on all the patches in this set! You've prompted me to make some significant improvements, even if the experience with the job signalling has ended up being a frustrating one. The code is otherwise definitely better now than before. I've just posted v5 of the patch set and I hope we're pretty close to "good enough" now. All the best, Phil