On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki <akihiko.od...@daynix.com> wrote:
> On 2024/10/28 22:31, Phil Dennis-Jordan wrote: > > > > > > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan <p...@philjordan.eu > > <mailto:p...@philjordan.eu>> wrote: > > > > > > > > > > > > Hmm. I think if we were to use that, we would need to > > create a new > > > > QemuEvent for every job and destroy it afterward, > > which seems > > > expensive. > > > > We can't rule out multiple concurrent jobs being > > submitted, and the > > > > QemuEvent system only supports a single producer as > > far as I can > > > tell. > > > > > > > > You can probably sort of hack around it with just one > > QemuEvent by > > > > putting the qemu_event_wait into a loop and turning > > the job.done > > > flag > > > > into an atomic (because it would now need to be > > checked outside the > > > > lock) but this all seems unnecessarily complicated > > considering the > > > > QemuEvent uses the same mechanism QemuCond/QemuMutex > > internally > > > on macOS > > > > (the only platform relevant here), except we can use > it as > > > intended with > > > > QemuCond/QemuMutex rather than having to work against > the > > > abstraction. > > > > > > I don't think it's going to be used concurrently. It > > would be difficult > > > to reason even for the framework if it performs memory > > > unmapping/mapping/reading operations concurrently. > > > > > > > > > I've just performed a very quick test by wrapping the job > > submission/ > > > wait in the 2 mapMemory callbacks and the 1 readMemory > > callback with > > > atomic counters and logging whenever a counter went above 1. > > > > > > * Overall, concurrent callbacks across all types were > > common (many per > > > second when the VM is busy). It's not exactly a "thundering > > herd" (I > > > never saw >2) but it's probably not a bad idea to use a > separate > > > condition variable for each job type. (task map, surface map, > > memory read) > > > * While I did not observe any concurrent memory mapping > > operations > > > *within* a type of memory map (2 task mappings or 2 surface > > mappings) I > > > did see very occasional concurrent memory *read* callbacks. > > These would, > > > as far as I can tell, not be safe with QemuEvents, unless we > > placed the > > > event inside the job struct and init/destroyed it on every > > callback > > > (which seems like excessive overhead). > > > > I think we can tolerate that overhead. init/destroy essentially > > sets the > > fields in the data structure and I estimate its total size is > > about 100 > > bytes. It is probably better than waking an irrelevant thread > > up. I also > > hope that keeps the code simple; it's not worthwhile adding code > to > > optimize this. > > > > > > At least pthread_cond_{init,destroy} and > > pthread_mutex_{init,destroy} don't make any syscalls, so yeah it's > > probably an acceptable overhead. > > > > > > I've just experimented with QemuEvents created on-demand and ran into > > some weird deadlocks, which then made me sit down and think about it > > some more. I've come to the conclusion that creating (and crucially, > > destroying) QemuEvents on demand in this way is not safe. > > > > Specifically, you must not call qemu_event_destroy() - which > > transitively destroys the mutex and condition variable - unless you can > > guarantee that the qemu_event_set() call on that event object has > completed. > > > > In qemu_event_set, the event object's value is atomically set to EV_SET. > > If the previous value was EV_BUSY, qemu_futex_wake() is called. All of > > this is outside any mutex, however, so apart from memory coherence > > (there are barriers) this can race with the waiting thread. > > qemu_event_wait() reads the event's value. If EV_FREE, it's atomically > > set to EV_BUSY. Then the mutex is locked, the value is checked again, > > and if it's still EV_BUSY, it waits for the condition variable, > > otherwise the mutex is immediately unlocked again. If the trigger > > thread's qemu_event_set() flip to EV_SET occurs between the waiting > > thread's two atomic reads of the value, the waiting thread will never > > wait for the condition variable, but the trigger thread WILL try to > > acquire the mutex and signal the condition variable in > > qemu_futex_wake(), by which time the waiting thread may have advanced > > outside of qemu_event_wait(). > > Sorry if I'm making a mistake again, but the waiting thread won't set to > EV_BUSY unless the value is EV_FREE on the second read so the trigger > thread will not call qemu_futex_wake() if it manages to set to EV_SET > before the second read, will it? > This sequence of events will cause the problem: WAITER (in qemu_event_wait): value = qatomic_load_acquire(&ev->value); -> EV_FREE TRIGGER (in qemu_event_set): qatomic_read(&ev->value) != EV_SET -> EV_FREE (condition is false) WAITER: qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET -> cmpxchg returns EV_FREE, condition false. ev->value = EV_BUSY. TRIGGER: int old = qatomic_xchg(&ev->value, EV_SET); smp_mb__after_rmw(); if (old == EV_BUSY) { -> old = EV_BUSY, condition true. ev->value = EV_SET WAITER (in qemu_futex_wait(ev, EV_BUSY)): pthread_mutex_lock(&ev->lock); if (ev->value == val) { -> false, because value is EV_SET WAITER: pthread_mutex_unlock(&ev->lock); … qemu_event_destroy(&job->done_event); TRIGGER (in qemu_futex_wake(ev, INT_MAX)): pthread_mutex_lock(&ev->lock); -> hangs, because mutex has been destroyed > > > This is all fine usually, BUT if you destroy the QemuEvent immediately > > after the qemu_event_wait() call, qemu_futex_wake() may try to lock a > > mutex that has been destroyed, or signal a condition variable which has > > been destroyed. I don't see a reasonable way of making this safe other > > than using long-lived mutexes and condition variables. And anyway, we > > have much, MUCH bigger contention/performance issues coming from almost > > everything being covered by the BQL. (If waking these callbacks can even > > be considered an issue: I haven't seen it show up in profiling, whereas > > BQL contention very much does.) > > > > I'll submit v5 of this patch set with separate condition variables for > > each job type. This should make the occurrence of waking the wrong > > thread quite rare, while reasoning about correctness is pretty > > straightforward. I think that's good enough. > > > > > > > >