On 2024/10/28 23:13, Phil Dennis-Jordan wrote:
On Mon, 28 Oct 2024 at 15:02, Akihiko Odaki <akihiko.od...@daynix.com
<mailto:akihiko.od...@daynix.com>> wrote:
On 2024/10/28 22:31, Phil Dennis-Jordan wrote:
>
>
> On Mon, 28 Oct 2024 at 10:00, Phil Dennis-Jordan
<p...@philjordan.eu <mailto:p...@philjordan.eu>
> <mailto:p...@philjordan.eu <mailto:p...@philjordan.eu>>> wrote:
>
>
> > >
> > > Hmm. I think if we were to use that, we would
need to
> create a new
> > > QemuEvent for every job and destroy it afterward,
> which seems
> > expensive.
> > > We can't rule out multiple concurrent jobs being
> submitted, and the
> > > QemuEvent system only supports a single producer as
> far as I can
> > tell.
> > >
> > > You can probably sort of hack around it with
just one
> QemuEvent by
> > > putting the qemu_event_wait into a loop and turning
> the job.done
> > flag
> > > into an atomic (because it would now need to be
> checked outside the
> > > lock) but this all seems unnecessarily complicated
> considering the
> > > QemuEvent uses the same mechanism QemuCond/
QemuMutex
> internally
> > on macOS
> > > (the only platform relevant here), except we
can use it as
> > intended with
> > > QemuCond/QemuMutex rather than having to work
against the
> > abstraction.
> >
> > I don't think it's going to be used concurrently. It
> would be difficult
> > to reason even for the framework if it performs memory
> > unmapping/mapping/reading operations concurrently.
> >
> >
> > I've just performed a very quick test by wrapping the job
> submission/
> > wait in the 2 mapMemory callbacks and the 1 readMemory
> callback with
> > atomic counters and logging whenever a counter went
above 1.
> >
> > * Overall, concurrent callbacks across all types were
> common (many per
> > second when the VM is busy). It's not exactly a
"thundering
> herd" (I
> > never saw >2) but it's probably not a bad idea to use
a separate
> > condition variable for each job type. (task map,
surface map,
> memory read)
> > * While I did not observe any concurrent memory mapping
> operations
> > *within* a type of memory map (2 task mappings or 2
surface
> mappings) I
> > did see very occasional concurrent memory *read*
callbacks.
> These would,
> > as far as I can tell, not be safe with QemuEvents,
unless we
> placed the
> > event inside the job struct and init/destroyed it on every
> callback
> > (which seems like excessive overhead).
>
> I think we can tolerate that overhead. init/destroy
essentially
> sets the
> fields in the data structure and I estimate its total size is
> about 100
> bytes. It is probably better than waking an irrelevant thread
> up. I also
> hope that keeps the code simple; it's not worthwhile
adding code to
> optimize this.
>
>
> At least pthread_cond_{init,destroy} and
> pthread_mutex_{init,destroy} don't make any syscalls, so yeah
it's
> probably an acceptable overhead.
>
>
> I've just experimented with QemuEvents created on-demand and ran
into
> some weird deadlocks, which then made me sit down and think about it
> some more. I've come to the conclusion that creating (and crucially,
> destroying) QemuEvents on demand in this way is not safe.
>
> Specifically, you must not call qemu_event_destroy() - which
> transitively destroys the mutex and condition variable - unless
you can
> guarantee that the qemu_event_set() call on that event object has
completed.
>
> In qemu_event_set, the event object's value is atomically set to
EV_SET.
> If the previous value was EV_BUSY, qemu_futex_wake() is called.
All of
> this is outside any mutex, however, so apart from memory coherence
> (there are barriers) this can race with the waiting thread.
> qemu_event_wait() reads the event's value. If EV_FREE, it's
atomically
> set to EV_BUSY. Then the mutex is locked, the value is checked
again,
> and if it's still EV_BUSY, it waits for the condition variable,
> otherwise the mutex is immediately unlocked again. If the trigger
> thread's qemu_event_set() flip to EV_SET occurs between the waiting
> thread's two atomic reads of the value, the waiting thread will
never
> wait for the condition variable, but the trigger thread WILL try to
> acquire the mutex and signal the condition variable in
> qemu_futex_wake(), by which time the waiting thread may have
advanced
> outside of qemu_event_wait().
Sorry if I'm making a mistake again, but the waiting thread won't
set to
EV_BUSY unless the value is EV_FREE on the second read so the trigger
thread will not call qemu_futex_wake() if it manages to set to EV_SET
before the second read, will it?
This sequence of events will cause the problem:
WAITER (in qemu_event_wait):
value = qatomic_load_acquire(&ev->value);
-> EV_FREE
TRIGGER (in qemu_event_set):
qatomic_read(&ev->value) != EV_SET
-> EV_FREE (condition is false)
WAITER:
qatomic_cmpxchg(&ev->value, EV_FREE, EV_BUSY) == EV_SET
-> cmpxchg returns EV_FREE, condition false.
ev->value = EV_BUSY.
> TRIGGER:
int old = qatomic_xchg(&ev->value, EV_SET);
smp_mb__after_rmw();
if (old == EV_BUSY) {
-> old = EV_BUSY, condition true.
ev->value = EV_SET
WAITER (in qemu_futex_wait(ev, EV_BUSY)):
pthread_mutex_lock(&ev->lock);
if (ev->value == val) {
-> false, because value is EV_SET
WAITER:
pthread_mutex_unlock(&ev->lock);
…
qemu_event_destroy(&job->done_event);
TRIGGER (in qemu_futex_wake(ev, INT_MAX)):
pthread_mutex_lock(&ev->lock);
-> hangs, because mutex has been destroyed
Thanks for clarification. This is very insightful.
>
> This is all fine usually, BUT if you destroy the QemuEvent
immediately
> after the qemu_event_wait() call, qemu_futex_wake() may try to
lock a
> mutex that has been destroyed, or signal a condition variable
which has
> been destroyed. I don't see a reasonable way of making this safe
other
> than using long-lived mutexes and condition variables. And
anyway, we
> have much, MUCH bigger contention/performance issues coming from
almost
> everything being covered by the BQL. (If waking these callbacks
can even
> be considered an issue: I haven't seen it show up in profiling,
whereas
> BQL contention very much does.)
>
> I'll submit v5 of this patch set with separate condition
variables for
> each job type. This should make the occurrence of waking the wrong
> thread quite rare, while reasoning about correctness is pretty
> straightforward. I think that's good enough.
What about using QemuSemaphore then? It does not seem to have the
problem same with QemuEvent.