On 2024/10/29 6:06, Phil Dennis-Jordan wrote:


On Mon, 28 Oct 2024 at 17:06, Akihiko Odaki <akihiko.od...@daynix.com <mailto:akihiko.od...@daynix.com>> wrote:

    On 2024/10/28 23:13, Phil Dennis-Jordan wrote:
     >
     >
     > On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki
    <akihiko.od...@daynix.com <mailto:akihiko.od...@daynix.com>
     > <mailto:akihiko.od...@daynix.com
    <mailto:akihiko.od...@daynix.com>>> wrote:
     >
     >     On 2024/10/28 22:31, Phil Dennis-Jordan wrote:
     >      >
     >      >
     >      > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan
     >     <p...@philjordan.eu <mailto:p...@philjordan.eu>
    <mailto:p...@philjordan.eu <mailto:p...@philjordan.eu>>
     >      > <mailto:p...@philjordan.eu <mailto:p...@philjordan.eu>
    <mailto:p...@philjordan.eu <mailto:p...@philjordan.eu>>>> wrote:
     >      >
     >      >
     >      >          >      >
     >      >          >      > Hmm. I think if we were to use that, we
    would
     >     need to
     >      >         create a new
     >      >          >      > QemuEvent for every job and destroy it
    afterward,
     >      >         which seems
     >      >          >     expensive.
     >      >          >      > We can't rule out multiple concurrent
    jobs being
     >      >         submitted, and the
     >      >          >      > QemuEvent system only supports a single
    producer as
     >      >         far as I can
     >      >          >     tell.
     >      >          >      >
     >      >          >      > You can probably sort of hack around it with
     >     just one
     >      >         QemuEvent by
     >      >          >      > putting the qemu_event_wait into a loop
    and turning
     >      >         the job.done
     >      >          >     flag
     >      >          >      > into an atomic (because it would now
    need to be
     >      >         checked outside the
     >      >          >      > lock) but this all seems unnecessarily
    complicated
     >      >         considering the
     >      >          >      > QemuEvent uses the same mechanism QemuCond/
     >     QemuMutex
     >      >         internally
     >      >          >     on macOS
     >      >          >      > (the only platform relevant here), except we
     >     can use it as
     >      >          >     intended with
     >      >          >      > QemuCond/QemuMutex rather than having to
    work
     >     against the
     >      >          >     abstraction.
     >      >          >
     >      >          >     I don't think it's going to be used
    concurrently. It
     >      >         would be difficult
     >      >          >     to reason even for the framework if it
    performs memory
     >      >          >     unmapping/mapping/reading operations
    concurrently.
     >      >          >
     >      >          >
     >      >          > I've just performed a very quick test by
    wrapping the job
     >      >         submission/
     >      >          > wait in the 2 mapMemory callbacks and the 1
    readMemory
     >      >         callback with
     >      >          > atomic counters and logging whenever a counter went
     >     above 1.
     >      >          >
     >      >          >   * Overall, concurrent callbacks across all
    types were
     >      >         common (many per
     >      >          > second when the VM is busy). It's not exactly a
     >     "thundering
     >      >         herd" (I
     >      >          > never saw >2) but it's probably not a bad idea
    to use
     >     a separate
     >      >          > condition variable for each job type. (task map,
     >     surface map,
     >      >         memory read)
     >      >          >   * While I did not observe any concurrent
    memory mapping
     >      >         operations
     >      >          > *within* a type of memory map (2 task mappings or 2
     >     surface
     >      >         mappings) I
     >      >          > did see very occasional concurrent memory *read*
     >     callbacks.
     >      >         These would,
     >      >          > as far as I can tell, not be safe with QemuEvents,
     >     unless we
     >      >         placed the
     >      >          > event inside the job struct and init/destroyed
    it on every
     >      >         callback
     >      >          > (which seems like excessive overhead).
     >      >
     >      >         I think we can tolerate that overhead. init/destroy
     >     essentially
     >      >         sets the
     >      >         fields in the data structure and I estimate its
    total size is
     >      >         about 100
     >      >         bytes. It is probably better than waking an
    irrelevant thread
     >      >         up. I also
     >      >         hope that keeps the code simple; it's not worthwhile
     >     adding code to
     >      >         optimize this.
     >      >
     >      >
     >      >     At least pthread_cond_{init,destroy} and
     >      >     pthread_mutex_{init,destroy} don't make any syscalls,
    so yeah
     >     it's
     >      >     probably an acceptable overhead.
     >      >
     >      >
     >      > I've just experimented with QemuEvents created on-demand
    and ran
     >     into
     >      > some weird deadlocks, which then made me sit down and
    think about it
     >      > some more. I've come to the conclusion that creating (and
    crucially,
     >      > destroying) QemuEvents on demand in this way is not safe.
     >      >
     >      > Specifically, you must not call qemu_event_destroy() - which
     >      > transitively destroys the mutex and condition variable -
    unless
     >     you can
     >      > guarantee that the qemu_event_set() call on that event
    object has
     >     completed.
     >      >
     >      > In qemu_event_set, the event object's value is atomically
    set to
     >     EV_SET.
     >      > If the previous value was EV_BUSY, qemu_futex_wake() is
    called.
     >     All of
     >      > this is outside any mutex, however, so apart from memory
    coherence
     >      > (there are barriers) this can race with the waiting thread.
     >      > qemu_event_wait() reads the event's value. If EV_FREE, it's
     >     atomically
     >      > set to EV_BUSY. Then the mutex is locked, the value is checked
     >     again,
     >      > and if it's still EV_BUSY, it waits for the condition
    variable,
     >      > otherwise the mutex is immediately unlocked again. If the
    trigger
     >      > thread's qemu_event_set() flip to EV_SET occurs between
    the waiting
     >      > thread's two atomic reads of the value, the waiting thread
    will
     >     never
     >      > wait for the condition variable, but the trigger thread
    WILL try to
     >      > acquire the mutex and signal the condition variable in
     >      > qemu_futex_wake(), by which  time the waiting thread may have
     >     advanced
     >      > outside of qemu_event_wait().
     >
     >     Sorry if I'm making a mistake again, but the waiting thread won't
     >     set to
     >     EV_BUSY unless the value is EV_FREE on the second read so the
    trigger
     >     thread will not call qemu_futex_wake() if it manages to set
    to EV_SET
     >     before the second read, will it?
     >
     >
     > This sequence of events will cause the problem:
     >
     > WAITER (in qemu_event_wait):
     > value = qatomic_load_acquire(&ev->value);
     > -> EV_FREE
     >
     > TRIGGER (in qemu_event_set):
     > qatomic_read(&ev->value) != EV_SET
     > -> EV_FREE (condition is false)
     >
     > WAITER:
     > qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET
     > -> cmpxchg returns EV_FREE, condition false.
     > ev->value =  EV_BUSY.
     > > TRIGGER:
     >          int old = qatomic_xchg(&ev->value, EV_SET);
     >          smp_mb__after_rmw();
     >          if (old == EV_BUSY) {
     > -> old = EV_BUSY, condition true.
     > ev->value = EV_SET
     >
     > WAITER (in qemu_futex_wait(ev, EV_BUSY)):
     >      pthread_mutex_lock(&ev->lock);
     >      if (ev->value == val) {
     > -> false, because value is EV_SET
     >
     > WAITER:
     >      pthread_mutex_unlock(&ev->lock);
     >      …
     >      qemu_event_destroy(&job->done_event);
     >
     > TRIGGER (in qemu_futex_wake(ev, INT_MAX)):
     >      pthread_mutex_lock(&ev->lock);
     > -> hangs, because mutex has been destroyed

    Thanks for clarification. This is very insightful.


     >
     >      >
     >      > This is all fine usually, BUT if you destroy the QemuEvent
     >     immediately
     >      > after the qemu_event_wait() call, qemu_futex_wake() may try to
     >     lock a
     >      > mutex that has been destroyed, or signal a condition variable
     >     which has
     >      > been destroyed. I don't see a reasonable way of making
    this safe
     >     other
     >      > than using long-lived mutexes and condition variables. And
     >     anyway, we
     >      > have much, MUCH bigger contention/performance issues
    coming from
     >     almost
     >      > everything being covered by the BQL. (If waking these
    callbacks
     >     can even
     >      > be considered an issue: I haven't seen it show up in
    profiling,
     >     whereas
     >      > BQL contention very much does.)
     >      >
     >      > I'll submit v5 of this patch set with separate condition
     >     variables for
     >      > each job type. This should make the occurrence of waking
    the wrong
     >      > thread quite rare, while reasoning about correctness is pretty
     >      > straightforward. I think that's good enough.

    What about using QemuSemaphore then? It does not seem to have the
    problem same with QemuEvent.


Nowhere else in the code base uses short-lived semaphores, and while I can't immediately see a risk (the mutex is held during both post and wait) there might be some non-obvious problem with the approach. Internally, the semaphores use condition variables. The solution using condition variables directly already works, is safe, relatively easy to reason about, and does not cause any performance issues. There is a tiny inefficiency about waking up a thread unnecessarily in the rare case when two callbacks of the same kind occur concurrently. In practice, it's irrelevant. Thanks to the awkward mismatch of the PVGraphics.framework's libdispatch based approach and Qemu's BQL/AIO/BH approach, we are already sending messages to other threads very frequently. This isn't ideal, but not fixable without drastically reducing the need to acquire the BQL across Qemu.

I found several usage of ephemeral semaphores:
h_random() in hw/ppc/spapr_rng.c
colo_process_checkpoint() in migration/colo.c
postcopy_thread_create() in migration/postcopy-ram.c

I'm sure short-lived semaphores will keep working (or break migration in strange ways).


I do not think it is worth spending even more time trying to fix this part of the code which isn't broken in the first place.

I'm sorry to bring you to this mess, which I didn't really expect. I thought combining a shared pair of conditional variable and mutex and job-specific bools is unnecessarily complex, and having one synchronization primitive for each job will be simpler and will just work.

However, there was a pitfall with QemuEvent as you demonstrated and now I grep QemuEvent and QemuSemaphore to find all such ephemeral usage is written with QemuSemaphore instead of QemuEvent. I think the critical problem here is that it is not documented that qemu_event_destroy() cannot be used immediately after qemu_event_wait(). We would not run into this situation if it is written. I will write a patch to add such a documentation comment.

Reply via email to