Re: [Qemu-devel] [RFC 05/38] thread-posix: inline qemu_spin functions

Emilio G. Cota Mon, 24 Aug 2015 19:30:32 -0700

On Sun, Aug 23, 2015 at 18:04:46 -0700, Paolo Bonzini wrote:
> On 23/08/2015 17:23, Emilio G. Cota wrote:
> > On some parallel workloads this gives up to a 15% speed improvement.
> > 
> > Signed-off-by: Emilio G. Cota <c...@braap.org>
> > ---
> >  include/qemu/thread-posix.h | 47 ++++++++++++++++++++++++++++++++++++++++++
> >  include/qemu/thread.h       |  6 ------
> >  util/qemu-thread-posix.c    | 50 
> > +++++----------------------------------------
> >  3 files changed, 52 insertions(+), 51 deletions(-)
(snip)
> Applied, but in the end the spinlock will probably simply use a simple
> test-and-test-and-set lock, or an MCS lock.  There is no need to use
> pthreads for this.


Agreed.

In fact in my tests I sometimes use concurrencykit [http://concurrencykit.org/] 
to
test lock alternatives (would love to be able to just add ck as a submodule
of qemu, but they do not support as many architectures as qemu does).

Note that fair locks (such as MCS) for user-space programs are not
necessarily a good idea when preemption is considered--and for usermode
we'd be forced (if we allowed MCS's to nest) to use per-lock stack variables
given that the number of threads is unbounded, which is pretty ugly.

If contention is a problem, a simple, fast spinlock combined with an exponential
backoff is already pretty good. Fairness is not a requirement (the cache
substrate of a NUMA machine isn't necessarily fair, is it?); scalability is.
If the algorithm in the guest requires fairness, the guest must use a fair lock
(e.g. MCS), and that works as intended when run natively or under qemu.

I just tested a fetch-and-swap+exp.backoff spinlock with usermode on a
program that spawns N threads and each thread performs an 2**M atomic increments
on the same variable. That is, a degenerate worst-case kind of contention.
N varies from 1 to 64, and M=15 on all runs, 5 runs per experiment:

  http://imgur.com/XpYctyT
  With backoff, the per-access latency grows roughly linearly with the number of
  cores, i.e. this is scalable. The other two are clearly superlinear.

The fastest spinlock as per ck's documentation (for uncontended cases) is
the fetch-and-swap lock. I just re-ran the usermode experiments from yesterday
with fas and fas+exp.backoff:

  http://imgur.com/OK2WZg8
  There really isn't much difference among the three candidates.
  
In light of these results I see very little against going for a solution
with exponential backoff.

Thanks,

                Emilio

Re: [Qemu-devel] [RFC 05/38] thread-posix: inline qemu_spin functions

Reply via email to