On 21/06/16 17:52, Pranith Kumar wrote: > Hi Sergey, > > On Mon, Jun 20, 2016 at 5:21 PM, Sergey Fedorov <serge.f...@gmail.com> wrote: >> On 18/06/16 07:03, Pranith Kumar wrote: >>> diff --git a/tcg/tcg.h b/tcg/tcg.h >>> index db6a062..36feca9 100644 >>> --- a/tcg/tcg.h >>> +++ b/tcg/tcg.h >>> @@ -408,6 +408,20 @@ static inline intptr_t QEMU_ARTIFICIAL >>> GET_TCGV_PTR(TCGv_ptr t) >>> #define TCG_CALL_DUMMY_TCGV MAKE_TCGV_I32(-1) >>> #define TCG_CALL_DUMMY_ARG ((TCGArg)(-1)) >>> >>> +typedef enum { >>> + TCG_MO_LD_LD = 1, >>> + TCG_MO_ST_LD = 2, >>> + TCG_MO_LD_ST = 4, >>> + TCG_MO_ST_ST = 8, >>> + TCG_MO_ALL = 0xF, // OR of all above >> So TCG_MO_ALL specifies a so called "full" memory barrier? > This enum just specifies what loads and stores need to be ordered. > > TCG_MO_ALL specifies that we need to order both previous loads and > stores with later loads and stores. To get a full memory barrier you > will need to pair it with BAR_SC: > > TCG_MO_ALL | TCG_BAR_SC
If we define the semantics for the flags above as it is defined for corresponding Sparc MEMBAR instruction mmask bits (which I think really makes sense), then a combination of all of these flags makes a full memory barrier which guarantees transitivity, i.e. sequential consistency. Let me just quote [Sparc v9 manual] regarding MEMBAR instruction mmask encoding: #StoreStore The effects of all stores appearing prior to the MEMBAR instruction must be visible *to all processors* before the effect of any stores following the MEMBAR. #LoadStore All loads appearing prior to the MEMBAR instruction must have been performed before the effect of any stores following the MEMBAR is visible *to any other processor*. #StoreLoad The effects of all stores appearing prior to the MEMBAR instruction must be visible *to all processors* before loads following the MEMBAR may be performed. #LoadLoad All loads appearing prior to the MEMBAR instruction must have been performed before any loads following the MEMBAR may be performed. I'm emphasising "to all processors" and "to any other processor" here because these expressions suggest transitivity, if I understand it correctly. [Sparc v9 manual] http://sparc.org/wp-content/uploads/2014/01/SPARCV9.pdf.gz > >>> +} TCGOrder; >>> + >>> +typedef enum { >>> + TCG_BAR_ACQ = 32, >>> + TCG_BAR_REL = 64, >> I'm convinced that the only practical way to represent a standalone >> acquire memory barrier is to order all previous loads with all >> subsequent loads and stores. Similarly, a standalone release memory >> barrier would order all previous loads and stores with all subsequent >> stores. [1] > Yes, here acquire would be: > > (TCG_MO_LD_ST | TCG_MO_LD_LD) | TCG_BAR_ACQ > > and release would be: > > (TCG_MO_ST_ST | TCG_MO_LD_ST) | TCG_BAR_REL Could you please explain the difference between: (TCG_MO_LD_ST | TCG_MO_LD_LD) | TCG_BAR_ACQ and (TCG_MO_LD_ST | TCG_MO_LD_LD) and TCG_BAR_ACQ or between: (TCG_MO_ST_ST | TCG_MO_LD_ST) | TCG_BAR_REL and (TCG_MO_ST_ST | TCG_MO_LD_ST) and TCG_BAR_REL ? (Please first consider the comments below and above.) > >> On the other hand, acquire or release semantic associated with a memory >> operation itself can be directly mapped into e.g. AArch64's Load-Acquire >> (LDAR) and Store-Release (STLR) instructions. A standalone barrier >> adjacent to a memory operation shouldn't be mapped this way because it >> should provide more strict guarantees than e.g. AArch64 instructions >> mentioned above. > You are right. That is why the load-acquire operation generates the > stronger barrier: > > TCG_MO_ALL | TCG_BAR_ACQ and not the acquire barrier above. Similarly > for store-release. I meant that e.g. Sparc TSO load could be efficiently mapped to ARMv8 or Itanium load-acquire and Sparc TSO store - to ARMv8 or Itanium store-release. But Sparc TSO load + membar #LoadLoad | #LoadStore cannot be mapped this way because ARMv8 or Itanium load-acquire semantics is weaker than the corresponding standalone barrier semantics. > >> Therefore, I advocate for clear distinction between standalone memory >> barriers and implicit memory ordering semantics associated with memory >> operations themselves. > Any suggestions on how to make the distinction clearer? I will add a > detailed comment like the above but please let me know if you have > anything in mind. My suggestion is to separate standalone memory barriers and implicit ordering requirements of loads/stores. Then a standalone guest memory barrier instruction will translate into a standalone TCG memory barrier operation which will translate into a standalone host memory barrier instruction (possibly no-op). We can take the semantic of Sparc v9 MEMBAR instruction mmask bits as the semantic of our standalone TCG memory barrier operation as described above. There seems to be no special "release" or "acquire" memory barrier as a standalone instruction in our guest/host architectures. I'm convinced that the best way to represent [C11] acquire fence is a combined memory barrier for load-load and load-store; the best way to represent [C11] release fence is a combined barrier for store-store and load-store; and the best way to represent [C11] sequential consistency fence is a combined barrier for all four flags. Orthogonally, we can attribute each guest memory load/store TCG operation with "acquire", "release" and "sequentially consistent" flags with the semantics as defined for [C11] 'memory_order' enum constants. I think we can skip "consume" semantics since it does only make sense for the ancient Alpha architecture which we don't support as QEMU host; let's just require that each load is always a consume-load. Then e.g. x86 or Sparc TSO regular load instruction will translate into TCG guest memory load operation with acquire flag set which will translate into e.g. Itanium or ARMv8 load-acquire instruction; and e.g. x86 or Sparc TSO regular store instruction will translate into TCG guest memory store operation with release flag set which will translate into e.g. Itanium or ARMv8 store-release instruction. That's supposed to be the most efficient way to support a strongly-ordered guest on a weakly-ordered host. Even if we disable MTTCG for such guest-host combinations, we may still like to support user-mode emulation for them which can happen to be multi-threaded. The key difference between: (1) a regular load followed by a combined memory barrier for load-load and load-store, and load-acquire; (2) a combined barrier for store-store and load-store followed by a regular store, and store-release - is that load-acquire/store-release is always associated with a particular address loaded/stored whereas a standalone barrier is supposed to order *all* corresponding loads/stores across the barrier. Thus a standalone barrier are stronger then a load-acquire/store-release and cannot be an efficient intermediate representation for this semantics. See also this email: http://thread.gmane.org/gmane.comp.emulators.qemu/420223/focus=421309 I would suggest to deal only with explicit standalone memory barrier instruction translation in this series. After that, we could start with the second part of implicit memory ordering requirements which is more complex topic anyway. [C11] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf > >> [1] http://preshing.com/20130922/acquire-and-release-fences/ >> >>> + TCG_BAR_SC = 128, >> How's that different from TCG_MO_ALL? > TCG_BAR_* tells us what ordering is enforced. TCG_MO_* tells what on > what operations the ordering is to be enforced. It sound like unnecessary duplication of the same information, see the explanation above. Kind regards, Sergey