On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki <akihiko.od...@daynix.com>
wrote:

> On 2024/10/28 22:31, Phil Dennis-Jordan wrote:
> >
> >
> > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan <p...@philjordan.eu
> > <mailto:p...@philjordan.eu>> wrote:
> >
> >
> >          >      >
> >          >      > Hmm. I think if we were to use that, we would need to
> >         create a new
> >          >      > QemuEvent for every job and destroy it afterward,
> >         which seems
> >          >     expensive.
> >          >      > We can't rule out multiple concurrent jobs being
> >         submitted, and the
> >          >      > QemuEvent system only supports a single producer as
> >         far as I can
> >          >     tell.
> >          >      >
> >          >      > You can probably sort of hack around it with just one
> >         QemuEvent by
> >          >      > putting the qemu_event_wait into a loop and turning
> >         the job.done
> >          >     flag
> >          >      > into an atomic (because it would now need to be
> >         checked outside the
> >          >      > lock) but this all seems unnecessarily complicated
> >         considering the
> >          >      > QemuEvent uses the same mechanism QemuCond/QemuMutex
> >         internally
> >          >     on macOS
> >          >      > (the only platform relevant here), except we can use
> it as
> >          >     intended with
> >          >      > QemuCond/QemuMutex rather than having to work against
> the
> >          >     abstraction.
> >          >
> >          >     I don't think it's going to be used concurrently. It
> >         would be difficult
> >          >     to reason even for the framework if it performs memory
> >          >     unmapping/mapping/reading operations concurrently.
> >          >
> >          >
> >          > I've just performed a very quick test by wrapping the job
> >         submission/
> >          > wait in the 2 mapMemory callbacks and the 1 readMemory
> >         callback with
> >          > atomic counters and logging whenever a counter went above 1.
> >          >
> >          >   * Overall, concurrent callbacks across all types were
> >         common (many per
> >          > second when the VM is busy). It's not exactly a "thundering
> >         herd" (I
> >          > never saw >2) but it's probably not a bad idea to use a
> separate
> >          > condition variable for each job type. (task map, surface map,
> >         memory read)
> >          >   * While I did not observe any concurrent memory mapping
> >         operations
> >          > *within* a type of memory map (2 task mappings or 2 surface
> >         mappings) I
> >          > did see very occasional concurrent memory *read* callbacks.
> >         These would,
> >          > as far as I can tell, not be safe with QemuEvents, unless we
> >         placed the
> >          > event inside the job struct and init/destroyed it on every
> >         callback
> >          > (which seems like excessive overhead).
> >
> >         I think we can tolerate that overhead. init/destroy essentially
> >         sets the
> >         fields in the data structure and I estimate its total size is
> >         about 100
> >         bytes. It is probably better than waking an irrelevant thread
> >         up. I also
> >         hope that keeps the code simple; it's not worthwhile adding code
> to
> >         optimize this.
> >
> >
> >     At least pthread_cond_{init,destroy} and
> >     pthread_mutex_{init,destroy} don't make any syscalls, so yeah it's
> >     probably an acceptable overhead.
> >
> >
> > I've just experimented with QemuEvents created on-demand and ran into
> > some weird deadlocks, which then made me sit down and think about it
> > some more. I've come to the conclusion that creating (and crucially,
> > destroying) QemuEvents on demand in this way is not safe.
> >
> > Specifically, you must not call qemu_event_destroy() - which
> > transitively destroys the mutex and condition variable - unless you can
> > guarantee that the qemu_event_set() call on that event object has
> completed.
> >
> > In qemu_event_set, the event object's value is atomically set to EV_SET.
> > If the previous value was EV_BUSY, qemu_futex_wake() is called. All of
> > this is outside any mutex, however, so apart from memory coherence
> > (there are barriers) this can race with the waiting thread.
> > qemu_event_wait() reads the event's value. If EV_FREE, it's atomically
> > set to EV_BUSY. Then the mutex is locked, the value is checked again,
> > and if it's still EV_BUSY, it waits for the condition variable,
> > otherwise the mutex is immediately unlocked again. If the trigger
> > thread's qemu_event_set() flip to EV_SET occurs between the waiting
> > thread's two atomic reads of the value, the waiting thread will never
> > wait for the condition variable, but the trigger thread WILL try to
> > acquire the mutex and signal the condition variable in
> > qemu_futex_wake(), by which  time the waiting thread may have advanced
> > outside of qemu_event_wait().
>
> Sorry if I'm making a mistake again, but the waiting thread won't set to
> EV_BUSY unless the value is EV_FREE on the second read so the trigger
> thread will not call qemu_futex_wake() if it manages to set to EV_SET
> before the second read, will it?
>

This sequence of events will cause the problem:

WAITER (in qemu_event_wait):
value = qatomic_load_acquire(&ev->value);
-> EV_FREE

TRIGGER (in qemu_event_set):
qatomic_read(&ev->value) != EV_SET
-> EV_FREE (condition is false)

WAITER:
qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET
-> cmpxchg returns EV_FREE, condition false.
ev->value =  EV_BUSY.

TRIGGER:
        int old = qatomic_xchg(&ev->value, EV_SET);
        smp_mb__after_rmw();
        if (old == EV_BUSY) {
-> old = EV_BUSY, condition true.
ev->value = EV_SET

WAITER (in qemu_futex_wait(ev, EV_BUSY)):
    pthread_mutex_lock(&ev->lock);
    if (ev->value == val) {
-> false, because value is EV_SET

WAITER:
    pthread_mutex_unlock(&ev->lock);
    …
    qemu_event_destroy(&job->done_event);

TRIGGER (in qemu_futex_wake(ev, INT_MAX)):
    pthread_mutex_lock(&ev->lock);
-> hangs, because mutex has been destroyed

>
> > This is all fine usually, BUT if you destroy the QemuEvent immediately
> > after the qemu_event_wait() call, qemu_futex_wake() may try to lock a
> > mutex that has been destroyed, or signal a condition variable which has
> > been destroyed. I don't see a reasonable way of making this safe other
> > than using long-lived mutexes and condition variables. And anyway, we
> > have much, MUCH bigger contention/performance issues coming from almost
> > everything being covered by the BQL. (If waking these callbacks can even
> > be considered an issue: I haven't seen it show up in profiling, whereas
> > BQL contention very much does.)
> >
> > I'll submit v5 of this patch set with separate condition variables for
> > each job type. This should make the occurrence of waking the wrong
> > thread quite rare, while reasoning about correctness is pretty
> > straightforward. I think that's good enough.
> >
> >
> >
>
>

Reply via email to