Re: [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation

Paolo Bonzini Wed, 30 Sep 2015 14:48:54 -0700


On 24/09/2015 10:32, Alvise Rigo wrote:
> The implementation heavily uses the software TLB together with a new
> bitmap that has been added to the ram_list structure which flags, on a
> per-CPU basis, all the memory pages that are in the middle of a LoadLink
> (LL), StoreConditional (SC) operation.  Since all these pages can be
> accessed directly through the fast-path and alter a vCPU's linked value,
> the new bitmap has been coupled with a new TLB flag for the TLB virtual
> address which forces the slow-path execution for all the accesses to a
> page containing a linked address.


Alvise, Emilio,

I have a doubt about your patches for ll/sc emulation, that I hope you
can clarify.

>From 10000ft, both approaches rely on checking a flag during stores.
This is split between the TLB and the CPUState for Alvise's patches (in
order to exploit the existing fast-path checks), and entirely in the
radix tree for Emilio's.  However, the idea is the same.

Now, the patch are okay for serial emulation, but I am not sure if it's
possible to do lock-free ll/sc emulation, because there is a race.

If we check the flag before the store, the race is as follows:

   CPU0        CPU1
   -------------------------------------------------------
   check flag
               load locked:
                  set flag
                  load value (normal load on CPU)
   store
               store conditional (normal store on CPU)

where the sc doesn't fail.  For completeness, if we check it afterwards
(which would be possible with Emilio's approach, though not for the
TLB-based one):

   CPU0        CPU1
   ------------------------------------------------------
               load locked
                  set bit
                  load value (normal load on CPU)
   store
               store conditional (normal store on CPU)
   check flag

and again the sc doesn't fail.

Most solutions I can think of are impractical:

- hardware ll/sc in CPU1. x86 doesn't have it.

- hardware transactional memory in CPU0, checking the bit after the
store and abort the transaction (I think).  HTM just doesn't exist.

- some kind of store-in-progress (SIP) flag that ll can test and force
failure of the corresponding sc.  For example, each CPU could store a
(last_store_address, last_store_value) tuple. If the value that LL loads
disagrees with any CPU, the LL would direct the SC to fail.  A store
would look like:

         store value to last_store_value
         smp_wmb()
         store address to last_store_address
         smp_mb()
         load TLB or radix tree

The memory barrier orders the store to the SIP flag and the load from
the TLB, and is probably too expensive. :(

- some array of atomic global generation counts, incremented
unconditionally on every store and checked between ll and sc.  Cacheline
bounce fiesta, hence extremely slow. :(

Tell me I'm wrong. :)

If I'm right, we can still keep the opcodes and implement them with a
simple cmpxchg.  It would provide a nice generic tool to implement
atomic operations, and it will work correctly if the target has ll/sc.
However, ll/sc-on-cmpxchg (e.g., ARM-on-x86) would be susceptible to the
ABA problem.

Paolo

Re: [Qemu-devel] [RFC v5 0/6] Slow-path for atomic instruction translation

Reply via email to