apple-gfx: Introduce ParavirtualizedGraphics.Framework support

Phil Dennis-Jordan Tue, 29 Oct 2024 14:18:08 -0700

On Tue, 29 Oct 2024 at 08:42, Akihiko Odaki <akihiko.od...@daynix.com>
wrote:


> On 2024/10/29 6:06, Phil Dennis-Jordan wrote:
> >
> >
> > On Mon, 28 Oct 2024 at 17:06, Akihiko Odaki <akihiko.od...@daynix.com
> > <mailto:akihiko.od...@daynix.com>> wrote:
> >
> >     On 2024/10/28 23:13, Phil Dennis-Jordan wrote:
> >      >
> >      >
> >      > On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki
> >     <akihiko.od...@daynix.com <mailto:akihiko.od...@daynix.com>
> >      > <mailto:akihiko.od...@daynix.com
> >     <mailto:akihiko.od...@daynix.com>>> wrote:
> >      >
> >      >     On 2024/10/28 22:31, Phil Dennis-Jordan wrote:
> >      >      >
> >      >      >
> >      >      > On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan
> >      >     <p...@philjordan.eu <mailto:p...@philjordan.eu>
> >     <mailto:p...@philjordan.eu <mailto:p...@philjordan.eu>>
> >      >      > <mailto:p...@philjordan.eu <mailto:p...@philjordan.eu>
> >     <mailto:p...@philjordan.eu <mailto:p...@philjordan.eu>>>> wrote:
> >      >      >
> >      >      >
> >      >      >          >      >
> >      >      >          >      > Hmm. I think if we were to use that, we
> >     would
> >      >     need to
> >      >      >         create a new
> >      >      >          >      > QemuEvent for every job and destroy it
> >     afterward,
> >      >      >         which seems
> >      >      >          >     expensive.
> >      >      >          >      > We can't rule out multiple concurrent
> >     jobs being
> >      >      >         submitted, and the
> >      >      >          >      > QemuEvent system only supports a single
> >     producer as
> >      >      >         far as I can
> >      >      >          >     tell.
> >      >      >          >      >
> >      >      >          >      > You can probably sort of hack around it
> with
> >      >     just one
> >      >      >         QemuEvent by
> >      >      >          >      > putting the qemu_event_wait into a loop
> >     and turning
> >      >      >         the job.done
> >      >      >          >     flag
> >      >      >          >      > into an atomic (because it would now
> >     need to be
> >      >      >         checked outside the
> >      >      >          >      > lock) but this all seems unnecessarily
> >     complicated
> >      >      >         considering the
> >      >      >          >      > QemuEvent uses the same mechanism
> QemuCond/
> >      >     QemuMutex
> >      >      >         internally
> >      >      >          >     on macOS
> >      >      >          >      > (the only platform relevant here),
> except we
> >      >     can use it as
> >      >      >          >     intended with
> >      >      >          >      > QemuCond/QemuMutex rather than having to
> >     work
> >      >     against the
> >      >      >          >     abstraction.
> >      >      >          >
> >      >      >          >     I don't think it's going to be used
> >     concurrently. It
> >      >      >         would be difficult
> >      >      >          >     to reason even for the framework if it
> >     performs memory
> >      >      >          >     unmapping/mapping/reading operations
> >     concurrently.
> >      >      >          >
> >      >      >          >
> >      >      >          > I've just performed a very quick test by
> >     wrapping the job
> >      >      >         submission/
> >      >      >          > wait in the 2 mapMemory callbacks and the 1
> >     readMemory
> >      >      >         callback with
> >      >      >          > atomic counters and logging whenever a counter
> went
> >      >     above 1.
> >      >      >          >
> >      >      >          >   * Overall, concurrent callbacks across all
> >     types were
> >      >      >         common (many per
> >      >      >          > second when the VM is busy). It's not exactly a
> >      >     "thundering
> >      >      >         herd" (I
> >      >      >          > never saw >2) but it's probably not a bad idea
> >     to use
> >      >     a separate
> >      >      >          > condition variable for each job type. (task map,
> >      >     surface map,
> >      >      >         memory read)
> >      >      >          >   * While I did not observe any concurrent
> >     memory mapping
> >      >      >         operations
> >      >      >          > *within* a type of memory map (2 task mappings
> or 2
> >      >     surface
> >      >      >         mappings) I
> >      >      >          > did see very occasional concurrent memory *read*
> >      >     callbacks.
> >      >      >         These would,
> >      >      >          > as far as I can tell, not be safe with
> QemuEvents,
> >      >     unless we
> >      >      >         placed the
> >      >      >          > event inside the job struct and init/destroyed
> >     it on every
> >      >      >         callback
> >      >      >          > (which seems like excessive overhead).
> >      >      >
> >      >      >         I think we can tolerate that overhead. init/destroy
> >      >     essentially
> >      >      >         sets the
> >      >      >         fields in the data structure and I estimate its
> >     total size is
> >      >      >         about 100
> >      >      >         bytes. It is probably better than waking an
> >     irrelevant thread
> >      >      >         up. I also
> >      >      >         hope that keeps the code simple; it's not
> worthwhile
> >      >     adding code to
> >      >      >         optimize this.
> >      >      >
> >      >      >
> >      >      >     At least pthread_cond_{init,destroy} and
> >      >      >     pthread_mutex_{init,destroy} don't make any syscalls,
> >     so yeah
> >      >     it's
> >      >      >     probably an acceptable overhead.
> >      >      >
> >      >      >
> >      >      > I've just experimented with QemuEvents created on-demand
> >     and ran
> >      >     into
> >      >      > some weird deadlocks, which then made me sit down and
> >     think about it
> >      >      > some more. I've come to the conclusion that creating (and
> >     crucially,
> >      >      > destroying) QemuEvents on demand in this way is not safe.
> >      >      >
> >      >      > Specifically, you must not call qemu_event_destroy() -
> which
> >      >      > transitively destroys the mutex and condition variable -
> >     unless
> >      >     you can
> >      >      > guarantee that the qemu_event_set() call on that event
> >     object has
> >      >     completed.
> >      >      >
> >      >      > In qemu_event_set, the event object's value is atomically
> >     set to
> >      >     EV_SET.
> >      >      > If the previous value was EV_BUSY, qemu_futex_wake() is
> >     called.
> >      >     All of
> >      >      > this is outside any mutex, however, so apart from memory
> >     coherence
> >      >      > (there are barriers) this can race with the waiting thread.
> >      >      > qemu_event_wait() reads the event's value. If EV_FREE, it's
> >      >     atomically
> >      >      > set to EV_BUSY. Then the mutex is locked, the value is
> checked
> >      >     again,
> >      >      > and if it's still EV_BUSY, it waits for the condition
> >     variable,
> >      >      > otherwise the mutex is immediately unlocked again. If the
> >     trigger
> >      >      > thread's qemu_event_set() flip to EV_SET occurs between
> >     the waiting
> >      >      > thread's two atomic reads of the value, the waiting thread
> >     will
> >      >     never
> >      >      > wait for the condition variable, but the trigger thread
> >     WILL try to
> >      >      > acquire the mutex and signal the condition variable in
> >      >      > qemu_futex_wake(), by which  time the waiting thread may
> have
> >      >     advanced
> >      >      > outside of qemu_event_wait().
> >      >
> >      >     Sorry if I'm making a mistake again, but the waiting thread
> won't
> >      >     set to
> >      >     EV_BUSY unless the value is EV_FREE on the second read so the
> >     trigger
> >      >     thread will not call qemu_futex_wake() if it manages to set
> >     to EV_SET
> >      >     before the second read, will it?
> >      >
> >      >
> >      > This sequence of events will cause the problem:
> >      >
> >      > WAITER (in qemu_event_wait):
> >      > value = qatomic_load_acquire(&ev->value);
> >      > -> EV_FREE
> >      >
> >      > TRIGGER (in qemu_event_set):
> >      > qatomic_read(&ev->value) != EV_SET
> >      > -> EV_FREE (condition is false)
> >      >
> >      > WAITER:
> >      > qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET
> >      > -> cmpxchg returns EV_FREE, condition false.
> >      > ev->value =  EV_BUSY.
> >      > > TRIGGER:
> >      >          int old = qatomic_xchg(&ev->value, EV_SET);
> >      >          smp_mb__after_rmw();
> >      >          if (old == EV_BUSY) {
> >      > -> old = EV_BUSY, condition true.
> >      > ev->value = EV_SET
> >      >
> >      > WAITER (in qemu_futex_wait(ev, EV_BUSY)):
> >      >      pthread_mutex_lock(&ev->lock);
> >      >      if (ev->value == val) {
> >      > -> false, because value is EV_SET
> >      >
> >      > WAITER:
> >      >      pthread_mutex_unlock(&ev->lock);
> >      >      …
> >      >      qemu_event_destroy(&job->done_event);
> >      >
> >      > TRIGGER (in qemu_futex_wake(ev, INT_MAX)):
> >      >      pthread_mutex_lock(&ev->lock);
> >      > -> hangs, because mutex has been destroyed
> >
> >     Thanks for clarification. This is very insightful.
> >
> >
> >      >
> >      >      >
> >      >      > This is all fine usually, BUT if you destroy the QemuEvent
> >      >     immediately
> >      >      > after the qemu_event_wait() call, qemu_futex_wake() may
> try to
> >      >     lock a
> >      >      > mutex that has been destroyed, or signal a condition
> variable
> >      >     which has
> >      >      > been destroyed. I don't see a reasonable way of making
> >     this safe
> >      >     other
> >      >      > than using long-lived mutexes and condition variables. And
> >      >     anyway, we
> >      >      > have much, MUCH bigger contention/performance issues
> >     coming from
> >      >     almost
> >      >      > everything being covered by the BQL. (If waking these
> >     callbacks
> >      >     can even
> >      >      > be considered an issue: I haven't seen it show up in
> >     profiling,
> >      >     whereas
> >      >      > BQL contention very much does.)
> >      >      >
> >      >      > I'll submit v5 of this patch set with separate condition
> >      >     variables for
> >      >      > each job type. This should make the occurrence of waking
> >     the wrong
> >      >      > thread quite rare, while reasoning about correctness is
> pretty
> >      >      > straightforward. I think that's good enough.
> >
> >     What about using QemuSemaphore then? It does not seem to have the
> >     problem same with QemuEvent.
> >
> >
> > Nowhere else in the code base uses short-lived semaphores, and while I
> > can't immediately see a risk (the mutex is held during both post and
> > wait) there might be some non-obvious problem with the approach.
> > Internally, the semaphores use condition variables. The solution using
> > condition variables directly already works, is safe, relatively easy to
> > reason about, and does not cause any performance issues. There is a tiny
> > inefficiency about waking up a thread unnecessarily in the rare case
> > when two callbacks of the same kind occur concurrently. In practice,
> > it's irrelevant. Thanks to the awkward mismatch of the
> > PVGraphics.framework's libdispatch based approach and Qemu's BQL/AIO/BH
> > approach, we are already sending messages to other threads very
> > frequently. This isn't ideal, but not fixable without drastically
> > reducing the need to acquire the BQL across Qemu.
>
> I found several usage of ephemeral semaphores:
> h_random() in hw/ppc/spapr_rng.c
> colo_process_checkpoint() in migration/colo.c
> postcopy_thread_create() in migration/postcopy-ram.c
>
> I'm sure short-lived semaphores will keep working (or break migration in
> strange ways).
>
> >
> > I do not think it is worth spending even more time trying to fix this
> > part of the code which isn't broken in the first place.
>
> I'm sorry to bring you to this mess, which I didn't really expect. I
> thought combining a shared pair of conditional variable and mutex and
> job-specific bools is unnecessarily complex, and having one
> synchronization primitive for each job will be simpler and will just work.
>

With multithreading, the devil is always in the detail! 😅 I wouldn't mind
if we were seeing genuine issues with the Mutex/Cond code, but it's fine as
far as I can tell. The QemuEvent version wasn't even really any simpler
(replacing bool done; with QemuEvent done_event; and await ->
init/wait/destroy gets longer while lock/broadcast/unlock -> set gets
shorter), and I guess a QemuSemaphore version would be about the same.
Relying on the way an edge case is handled - destroying immediately after
waiting - in the long term potentially makes the code more fragile too in
case implementation details change. I think we've reached a bikeshedding
stage here, and I suggest any further improvements on this part other than
bug fixes should be deferred to future patches.


> However, there was a pitfall with QemuEvent as you demonstrated and now
> I grep QemuEvent and QemuSemaphore to find all such ephemeral usage is
> written with QemuSemaphore instead of QemuEvent. I think the critical
> problem here is that it is not documented that qemu_event_destroy()
> cannot be used immediately after qemu_event_wait(). We would not run
> into this situation if it is written. I will write a patch to add such a
> documentation comment.
>

Sounds good, I'd be happy to review it (cc me).

Thanks for the in-depth reviews on all the patches in this set! You've
prompted me to make some significant improvements, even if the experience
with the job signalling has ended up being a frustrating one. The code is
otherwise definitely better now than before. I've just posted v5 of the
patch set and I hope we're pretty close to "good enough" now.

All the best,
Phil

Re: [PATCH v4 02/15] hw/display/apple-gfx: Introduce ParavirtualizedGraphics.Framework support

Reply via email to