On 2024/10/28 23:13, Phil Dennis-Jordan wrote:


On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki <akihiko.od...@daynix.com <mailto:akihiko.od...@daynix.com>> wrote:

    On 2024/10/28 22:31, Phil Dennis-Jordan wrote:
     >
     >
     > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan
    <p...@philjordan.eu <mailto:p...@philjordan.eu>
     > <mailto:p...@philjordan.eu <mailto:p...@philjordan.eu>>> wrote:
     >
     >
     >          >      >
     >          >      > Hmm. I think if we were to use that, we would
    need to
     >         create a new
     >          >      > QemuEvent for every job and destroy it afterward,
     >         which seems
     >          >     expensive.
     >          >      > We can't rule out multiple concurrent jobs being
     >         submitted, and the
     >          >      > QemuEvent system only supports a single producer as
     >         far as I can
     >          >     tell.
     >          >      >
     >          >      > You can probably sort of hack around it with
    just one
     >         QemuEvent by
     >          >      > putting the qemu_event_wait into a loop and turning
     >         the job.done
     >          >     flag
     >          >      > into an atomic (because it would now need to be
     >         checked outside the
     >          >      > lock) but this all seems unnecessarily complicated
     >         considering the
     >          >      > QemuEvent uses the same mechanism QemuCond/
    QemuMutex
     >         internally
     >          >     on macOS
     >          >      > (the only platform relevant here), except we
    can use it as
     >          >     intended with
     >          >      > QemuCond/QemuMutex rather than having to work
    against the
     >          >     abstraction.
     >          >
     >          >     I don't think it's going to be used concurrently. It
     >         would be difficult
     >          >     to reason even for the framework if it performs memory
     >          >     unmapping/mapping/reading operations concurrently.
     >          >
     >          >
     >          > I've just performed a very quick test by wrapping the job
     >         submission/
     >          > wait in the 2 mapMemory callbacks and the 1 readMemory
     >         callback with
     >          > atomic counters and logging whenever a counter went
    above 1.
     >          >
     >          >   * Overall, concurrent callbacks across all types were
     >         common (many per
     >          > second when the VM is busy). It's not exactly a
    "thundering
     >         herd" (I
     >          > never saw >2) but it's probably not a bad idea to use
    a separate
     >          > condition variable for each job type. (task map,
    surface map,
     >         memory read)
     >          >   * While I did not observe any concurrent memory mapping
     >         operations
     >          > *within* a type of memory map (2 task mappings or 2
    surface
     >         mappings) I
     >          > did see very occasional concurrent memory *read*
    callbacks.
     >         These would,
     >          > as far as I can tell, not be safe with QemuEvents,
    unless we
     >         placed the
     >          > event inside the job struct and init/destroyed it on every
     >         callback
     >          > (which seems like excessive overhead).
     >
     >         I think we can tolerate that overhead. init/destroy
    essentially
     >         sets the
     >         fields in the data structure and I estimate its total size is
     >         about 100
     >         bytes. It is probably better than waking an irrelevant thread
     >         up. I also
     >         hope that keeps the code simple; it's not worthwhile
    adding code to
     >         optimize this.
     >
     >
     >     At least pthread_cond_{init,destroy} and
     >     pthread_mutex_{init,destroy} don't make any syscalls, so yeah
    it's
     >     probably an acceptable overhead.
     >
     >
     > I've just experimented with QemuEvents created on-demand and ran
    into
     > some weird deadlocks, which then made me sit down and think about it
     > some more. I've come to the conclusion that creating (and crucially,
     > destroying) QemuEvents on demand in this way is not safe.
     >
     > Specifically, you must not call qemu_event_destroy() - which
     > transitively destroys the mutex and condition variable - unless
    you can
     > guarantee that the qemu_event_set() call on that event object has
    completed.
     >
     > In qemu_event_set, the event object's value is atomically set to
    EV_SET.
     > If the previous value was EV_BUSY, qemu_futex_wake() is called.
    All of
     > this is outside any mutex, however, so apart from memory coherence
     > (there are barriers) this can race with the waiting thread.
     > qemu_event_wait() reads the event's value. If EV_FREE, it's
    atomically
     > set to EV_BUSY. Then the mutex is locked, the value is checked
    again,
     > and if it's still EV_BUSY, it waits for the condition variable,
     > otherwise the mutex is immediately unlocked again. If the trigger
     > thread's qemu_event_set() flip to EV_SET occurs between the waiting
     > thread's two atomic reads of the value, the waiting thread will
    never
     > wait for the condition variable, but the trigger thread WILL try to
     > acquire the mutex and signal the condition variable in
     > qemu_futex_wake(), by which  time the waiting thread may have
    advanced
     > outside of qemu_event_wait().

    Sorry if I'm making a mistake again, but the waiting thread won't
    set to
    EV_BUSY unless the value is EV_FREE on the second read so the trigger
    thread will not call qemu_futex_wake() if it manages to set to EV_SET
    before the second read, will it?


This sequence of events will cause the problem:

WAITER (in qemu_event_wait):
value = qatomic_load_acquire(&ev->value);
-> EV_FREE

TRIGGER (in qemu_event_set):
qatomic_read(&ev->value) != EV_SET
-> EV_FREE (condition is false)

WAITER:
qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET
-> cmpxchg returns EV_FREE, condition false.
ev->value =  EV_BUSY.
> TRIGGER:
         int old = qatomic_xchg(&ev->value, EV_SET);
         smp_mb__after_rmw();
         if (old == EV_BUSY) {
-> old = EV_BUSY, condition true.
ev->value = EV_SET

WAITER (in qemu_futex_wait(ev, EV_BUSY)):
     pthread_mutex_lock(&ev->lock);
     if (ev->value == val) {
-> false, because value is EV_SET

WAITER:
     pthread_mutex_unlock(&ev->lock);
     …
     qemu_event_destroy(&job->done_event);

TRIGGER (in qemu_futex_wake(ev, INT_MAX)):
     pthread_mutex_lock(&ev->lock);
-> hangs, because mutex has been destroyed

Thanks for clarification. This is very insightful.



     >
     > This is all fine usually, BUT if you destroy the QemuEvent
    immediately
     > after the qemu_event_wait() call, qemu_futex_wake() may try to
    lock a
     > mutex that has been destroyed, or signal a condition variable
    which has
     > been destroyed. I don't see a reasonable way of making this safe
    other
     > than using long-lived mutexes and condition variables. And
    anyway, we
     > have much, MUCH bigger contention/performance issues coming from
    almost
     > everything being covered by the BQL. (If waking these callbacks
    can even
     > be considered an issue: I haven't seen it show up in profiling,
    whereas
     > BQL contention very much does.)
     >
     > I'll submit v5 of this patch set with separate condition
    variables for
     > each job type. This should make the occurrence of waking the wrong
     > thread quite rare, while reasoning about correctness is pretty
     > straightforward. I think that's good enough.

What about using QemuSemaphore then? It does not seem to have the problem same with QemuEvent.

Reply via email to