Re: [Qemu-devel] [RFC v3 PATCH 01/14] Introduce TCGOpcode for memory barrier

Sergey Fedorov Wed, 22 Jun 2016 09:00:47 -0700

On 21/06/16 17:52, Pranith Kumar wrote:
> Hi Sergey,
>
> On Mon, Jun 20, 2016 at 5:21 PM, Sergey Fedorov <serge.f...@gmail.com> wrote:
>> On 18/06/16 07:03, Pranith Kumar wrote:
>>> diff --git a/tcg/tcg.h b/tcg/tcg.h
>>> index db6a062..36feca9 100644
>>> --- a/tcg/tcg.h
>>> +++ b/tcg/tcg.h
>>> @@ -408,6 +408,20 @@ static inline intptr_t QEMU_ARTIFICIAL 
>>> GET_TCGV_PTR(TCGv_ptr t)
>>>  #define TCG_CALL_DUMMY_TCGV     MAKE_TCGV_I32(-1)
>>>  #define TCG_CALL_DUMMY_ARG      ((TCGArg)(-1))
>>>
>>> +typedef enum {
>>> +    TCG_MO_LD_LD    = 1,
>>> +    TCG_MO_ST_LD    = 2,
>>> +    TCG_MO_LD_ST    = 4,
>>> +    TCG_MO_ST_ST    = 8,
>>> +    TCG_MO_ALL      = 0xF, // OR of all above
>> So TCG_MO_ALL specifies a so called "full" memory barrier?
> This enum just specifies what loads and stores need to be ordered.
>
> TCG_MO_ALL specifies that we need to order both previous loads and
> stores with later loads and stores. To get a full memory barrier you
> will need to pair it with BAR_SC:
>
> TCG_MO_ALL | TCG_BAR_SC


If we define the semantics for the flags above as it is defined for
corresponding Sparc MEMBAR instruction mmask bits (which I think really
makes sense), then a combination of all of these flags makes a full
memory barrier which guarantees transitivity, i.e. sequential
consistency. Let me just quote [Sparc v9 manual] regarding MEMBAR
instruction mmask encoding:

    #StoreStore The effects of all stores appearing prior to the MEMBAR
    instruction must be visible *to all processors* before the effect of
    any stores following the MEMBAR.
    #LoadStore  All loads appearing prior to the MEMBAR instruction must
    have been performed before the effect of any stores following the
    MEMBAR is visible *to any other processor*.
    #StoreLoad  The effects of all stores appearing prior to the MEMBAR
    instruction must be visible *to all processors* before loads
    following the MEMBAR may be performed.
    #LoadLoad   All loads appearing prior to the MEMBAR instruction must
    have been performed before any loads following the MEMBAR may be
    performed.

I'm emphasising "to all processors" and "to any other processor" here
because these expressions suggest transitivity, if I understand it
correctly.

[Sparc v9 manual] http://sparc.org/wp-content/uploads/2014/01/SPARCV9.pdf.gz

>
>>> +} TCGOrder;
>>> +
>>> +typedef enum {
>>> +    TCG_BAR_ACQ     = 32,
>>> +    TCG_BAR_REL     = 64,
>> I'm convinced that the only practical way to represent a standalone
>> acquire memory barrier is to order all previous loads with all
>> subsequent loads and stores. Similarly, a standalone release memory
>> barrier would order all previous loads and stores with all subsequent
>> stores. [1]
> Yes, here acquire would be:
>
> (TCG_MO_LD_ST | TCG_MO_LD_LD) | TCG_BAR_ACQ
>
> and release would be:
>
> (TCG_MO_ST_ST | TCG_MO_LD_ST) | TCG_BAR_REL

Could you please explain the difference between:

(TCG_MO_LD_ST | TCG_MO_LD_LD) | TCG_BAR_ACQ

and

(TCG_MO_LD_ST | TCG_MO_LD_LD)

and

TCG_BAR_ACQ

or between:

(TCG_MO_ST_ST | TCG_MO_LD_ST) | TCG_BAR_REL

and

(TCG_MO_ST_ST | TCG_MO_LD_ST)

and

TCG_BAR_REL

?

(Please first consider the comments below and above.)

>
>> On the other hand, acquire or release semantic associated with a memory
>> operation itself can be directly mapped into e.g. AArch64's Load-Acquire
>> (LDAR) and Store-Release (STLR) instructions. A standalone barrier
>> adjacent to a memory operation shouldn't be mapped this way because it
>> should provide more strict guarantees than e.g. AArch64 instructions
>> mentioned above.
> You are right. That is why the load-acquire operation generates the
> stronger barrier:
>
> TCG_MO_ALL | TCG_BAR_ACQ and not the acquire barrier above. Similarly
> for store-release.

I meant that e.g. Sparc TSO load could be efficiently mapped to ARMv8 or
Itanium load-acquire and Sparc TSO store - to ARMv8 or Itanium
store-release. But Sparc TSO load + membar #LoadLoad | #LoadStore cannot
be mapped this way because ARMv8 or Itanium load-acquire semantics is
weaker than the corresponding standalone barrier semantics.

>
>> Therefore, I advocate for clear distinction between standalone memory
>> barriers and implicit memory ordering semantics associated with memory
>> operations themselves.
> Any suggestions on how to make the distinction clearer? I will add a
> detailed comment like the above but please let me know if you have
> anything in mind.

My suggestion is to separate standalone memory barriers and implicit
ordering requirements of loads/stores.

Then a standalone guest memory barrier instruction will translate into a
standalone TCG memory barrier operation which will translate into a
standalone host memory barrier instruction (possibly no-op). We can take
the semantic of Sparc v9 MEMBAR instruction mmask bits as the semantic
of our standalone TCG memory barrier operation as described above. There
seems to be no special "release" or "acquire" memory barrier as a
standalone instruction in our guest/host architectures. I'm convinced
that the best way to represent [C11] acquire fence is a combined memory
barrier for load-load and load-store; the best way to represent [C11]
release fence is a combined barrier for store-store and load-store; and
the best way to represent [C11] sequential consistency fence is a
combined barrier for all four flags.

Orthogonally, we can attribute each guest memory load/store TCG
operation with "acquire", "release" and "sequentially consistent" flags
with the semantics as defined for [C11] 'memory_order' enum constants. I
think we can skip "consume" semantics since it does only make sense for
the ancient Alpha architecture which we don't support as QEMU host;
let's just require that each load is always a consume-load. Then e.g.
x86 or Sparc TSO regular load instruction will translate into TCG guest
memory load operation with acquire flag set which will translate into
e.g. Itanium or ARMv8 load-acquire instruction; and e.g. x86 or Sparc
TSO regular store instruction will translate into TCG guest memory store
operation with release flag set which will translate into e.g. Itanium
or ARMv8 store-release instruction. That's supposed to be the most
efficient way to support a strongly-ordered guest on a weakly-ordered
host. Even if we disable MTTCG for such guest-host combinations, we may
still like to support user-mode emulation for them which can happen to
be multi-threaded.

The key difference between: (1) a regular load followed by a combined
memory barrier for load-load and load-store, and load-acquire; (2) a
combined barrier for store-store and load-store followed by a regular
store, and store-release - is that load-acquire/store-release is always
associated with a particular address loaded/stored whereas a standalone
barrier is supposed to order *all* corresponding loads/stores across the
barrier. Thus a standalone barrier are stronger then a
load-acquire/store-release and cannot be an efficient intermediate
representation for this semantics.

See also this email:
http://thread.gmane.org/gmane.comp.emulators.qemu/420223/focus=421309

I would suggest to deal only with explicit standalone memory barrier
instruction translation in this series. After that, we could start with
the second part of implicit memory ordering requirements which is more
complex topic anyway.

[C11] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf

>
>> [1] http://preshing.com/20130922/acquire-and-release-fences/
>>
>>> +    TCG_BAR_SC      = 128,
>> How's that different from TCG_MO_ALL?
> TCG_BAR_* tells us what ordering is enforced. TCG_MO_* tells what on
> what operations the ordering is to be enforced.

It sound like unnecessary duplication of the same information, see the
explanation above.

Kind regards,
Sergey

Re: [Qemu-devel] [RFC v3 PATCH 01/14] Introduce TCGOpcode for memory barrier

Reply via email to