apple-gfx: Introduce ParavirtualizedGraphics.Framework support

Akihiko Odaki Mon, 28 Oct 2024 07:04:00 -0700

On 2024/10/28 22:31, Phil Dennis-Jordan wrote:

On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan <p...@philjordan.eu<mailto:p...@philjordan.eu>> wrote:
         >      >
         >      > Hmm. I think if we were to use that, we would need to
        create a new
         >      > QemuEvent for every job and destroy it afterward,
        which seems
         >     expensive.
         >      > We can't rule out multiple concurrent jobs being
        submitted, and the
         >      > QemuEvent system only supports a single producer as
        far as I can
         >     tell.
         >      >
         >      > You can probably sort of hack around it with just one
        QemuEvent by
         >      > putting the qemu_event_wait into a loop and turning
        the job.done
         >     flag
         >      > into an atomic (because it would now need to be
        checked outside the
         >      > lock) but this all seems unnecessarily complicated
        considering the
         >      > QemuEvent uses the same mechanism QemuCond/QemuMutex
        internally
         >     on macOS
         >      > (the only platform relevant here), except we can use it as
         >     intended with
         >      > QemuCond/QemuMutex rather than having to work against the
         >     abstraction.
         >
         >     I don't think it's going to be used concurrently. It
        would be difficult
         >     to reason even for the framework if it performs memory
         >     unmapping/mapping/reading operations concurrently.
         >
         >
         > I've just performed a very quick test by wrapping the job
        submission/
         > wait in the 2 mapMemory callbacks and the 1 readMemory
        callback with
         > atomic counters and logging whenever a counter went above 1.
         >
         >   * Overall, concurrent callbacks across all types were
        common (many per
         > second when the VM is busy). It's not exactly a "thundering
        herd" (I
         > never saw >2) but it's probably not a bad idea to use a separate
         > condition variable for each job type. (task map, surface map,
        memory read)
         >   * While I did not observe any concurrent memory mapping
        operations
         > *within* a type of memory map (2 task mappings or 2 surface
        mappings) I
         > did see very occasional concurrent memory *read* callbacks.
        These would,
         > as far as I can tell, not be safe with QemuEvents, unless we
        placed the
         > event inside the job struct and init/destroyed it on every
        callback
         > (which seems like excessive overhead).

        I think we can tolerate that overhead. init/destroy essentially
        sets the
        fields in the data structure and I estimate its total size is
        about 100
        bytes. It is probably better than waking an irrelevant thread
        up. I also
        hope that keeps the code simple; it's not worthwhile adding code to
        optimize this.


    At least pthread_cond_{init,destroy} and
    pthread_mutex_{init,destroy} don't make any syscalls, so yeah it's
    probably an acceptable overhead.
I've just experimented with QemuEvents created on-demand and ran intosome weird deadlocks, which then made me sit down and think about itsome more. I've come to the conclusion that creating (and crucially,destroying) QemuEvents on demand in this way is not safe.
Specifically, you must not call qemu_event_destroy() - whichtransitively destroys the mutex and condition variable - unless you canguarantee that the qemu_event_set() call on that event object has completed.
In qemu_event_set, the event object's value is atomically set to EV_SET.If the previous value was EV_BUSY, qemu_futex_wake() is called. All ofthis is outside any mutex, however, so apart from memory coherence(there are barriers) this can race with the waiting thread.qemu_event_wait() reads the event's value. If EV_FREE, it's atomicallyset to EV_BUSY. Then the mutex is locked, the value is checked again,and if it's still EV_BUSY, it waits for the condition variable,otherwise the mutex is immediately unlocked again. If the triggerthread's qemu_event_set() flip to EV_SET occurs between the waitingthread's two atomic reads of the value, the waiting thread will neverwait for the condition variable, but the trigger thread WILL try toacquire the mutex and signal the condition variable inqemu_futex_wake(), by which time the waiting thread may have advancedoutside of qemu_event_wait().

Sorry if I'm making a mistake again, but the waiting thread won't set toEV_BUSY unless the value is EV_FREE on the second read so the triggerthread will not call qemu_futex_wake() if it manages to set to EV_SETbefore the second read, will it?

This is all fine usually, BUT if you destroy the QemuEvent immediatelyafter the qemu_event_wait() call, qemu_futex_wake() may try to lock amutex that has been destroyed, or signal a condition variable which hasbeen destroyed. I don't see a reasonable way of making this safe otherthan using long-lived mutexes and condition variables. And anyway, wehave much, MUCH bigger contention/performance issues coming from almosteverything being covered by the BQL. (If waking these callbacks can evenbe considered an issue: I haven't seen it show up in profiling, whereasBQL contention very much does.)
I'll submit v5 of this patch set with separate condition variables foreach job type. This should make the occurrence of waking the wrongthread quite rare, while reasoning about correctness is prettystraightforward. I think that's good enough.

Re: [PATCH v4 02/15] hw/display/apple-gfx: Introduce ParavirtualizedGraphics.Framework support

Reply via email to