riscv: introduce cmpxchg.h

Jan Beulich Thu, 07 Mar 2024 02:46:52 -0800

On 07.03.2024 11:35, Oleksii wrote:
> On Wed, 2024-03-06 at 15:56 +0100, Jan Beulich wrote:
>> On 26.02.2024 18:38, Oleksii Kurochko wrote:
>>> The header was taken from Linux kernl 6.4.0-rc1.
>>>
>>> Addionally, were updated:
>>> * add emulation of {cmp}xchg for 1/2 byte types using 32-bit atomic
>>>   access.
>>> * replace tabs with spaces
>>> * replace __* variale with *__
>>> * introduce generic version of xchg_* and cmpxchg_*.
>>>
>>> Implementation of 4- and 8-byte cases were left as it is done in
>>> Linux kernel as according to the RISC-V spec:
>>> ```
>>> Table A.5 ( only part of the table was copied here )
>>>
>>> Linux Construct       RVWMO Mapping
>>> atomic <op> relaxed    amo<op>.{w|d}
>>> atomic <op> acquire    amo<op>.{w|d}.aq
>>> atomic <op> release    amo<op>.{w|d}.rl
>>> atomic <op>            amo<op>.{w|d}.aqrl
>>>
>>> Linux Construct       RVWMO LR/SC Mapping
>>> atomic <op> relaxed    loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
>>> atomic <op> acquire    loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
>>> atomic <op> release    loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez
>>> loop OR
>>>                        fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ;
>>> bnez loop
>>> atomic <op>            loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez
>>> loop
>>>
>>> The Linux mappings for release operations may seem stronger than
>>> necessary,
>>> but these mappings are needed to cover some cases in which Linux
>>> requires
>>> stronger orderings than the more intuitive mappings would provide.
>>> In particular, as of the time this text is being written, Linux is
>>> actively
>>> debating whether to require load-load, load-store, and store-store
>>> orderings
>>> between accesses in one critical section and accesses in a
>>> subsequent critical
>>> section in the same hart and protected by the same synchronization
>>> object.
>>> Not all combinations of FENCE RW,W/FENCE R,RW mappings with aq/rl
>>> mappings
>>> combine to provide such orderings.
>>> There are a few ways around this problem, including:
>>> 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This
>>> suffices
>>>    but is undesirable, as it defeats the purpose of the aq/rl
>>> modifiers.
>>> 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does
>>> not
>>>    currently work due to the lack of load and store opcodes with aq
>>> and rl
>>>    modifiers.
>>
>> As before I don't understand this point. Can you give an example of
>> what
>> sort of opcode / instruction is missing?
> If I understand the spec correctly then l{b|h|w|d} and s{b|h|w|d}
> instructions don't have aq or rl annotation.


How would load insns other that LR and store insns other than SC come
into play here?

>>> 3. Strengthen the mappings of release operations such that they
>>> would
>>>    enforce sufficient orderings in the presence of either type of
>>> acquire mapping.
>>>    This is the currently-recommended solution, and the one shown in
>>> Table A.5.
>>> ```
>>>
>>> But in Linux kenrel atomics were strengthen with fences:
>>> ```
>>> Atomics present the same issue with locking: release and acquire
>>> variants need to be strengthened to meet the constraints defined
>>> by the Linux-kernel memory consistency model [1].
>>>
>>> Atomics present a further issue: implementations of atomics such
>>> as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs,
>>> which do not give full-ordering with .aqrl; for example, current
>>> implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test
>>> below to end up with the state indicated in the "exists" clause.
>>>
>>> In order to "synchronize" LKMM and RISC-V's implementation, this
>>> commit strengthens the implementations of the atomics operations
>>> by replacing .rl and .aq with the use of ("lightweigth") fences,
>>> and by replacing .aqrl LR/SC pairs in sequences such as:
>>>
>>> 0:      lr.w.aqrl  %0, %addr
>>>         bne        %0, %old, 1f
>>>         ...
>>>         sc.w.aqrl  %1, %new, %addr
>>>         bnez       %1, 0b
>>> 1:
>>>
>>> with sequences of the form:
>>>
>>> 0:      lr.w       %0, %addr
>>>         bne        %0, %old, 1f
>>>               ...
>>>         sc.w.rl    %1, %new, %addr   /* SC-release   */
>>>         bnez       %1, 0b
>>>         fence      rw, rw            /* "full" fence */
>>> 1:
>>>
>>> following Daniel's suggestion.
>>>
>>> These modifications were validated with simulation of the RISC-V
>>> memory consistency model.
>>>
>>> C lr-sc-aqrl-pair-vs-full-barrier
>>>
>>> {}
>>>
>>> P0(int *x, int *y, atomic_t *u)
>>> {
>>>         int r0;
>>>         int r1;
>>>
>>>         WRITE_ONCE(*x, 1);
>>>         r0 = atomic_cmpxchg(u, 0, 1);
>>>         r1 = READ_ONCE(*y);
>>> }
>>>
>>> P1(int *x, int *y, atomic_t *v)
>>> {
>>>         int r0;
>>>         int r1;
>>>
>>>         WRITE_ONCE(*y, 1);
>>>         r0 = atomic_cmpxchg(v, 0, 1);
>>>         r1 = READ_ONCE(*x);
>>> }
>>>
>>> exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0)
>>
>> While I'm entirely willing to trust this can happen, I can't bring
>> this
>> in line with the A extension spec.
>>
>> Additionally it's not clear to me in how far all of this applies when
>> you don't really use LR/SC in the 4- and 8-byte cases (and going
>> forward
>> likely also not in the 1- and 2-byte case, utilizing Zahba when
>> available).
> It just explain what combination of fences, lr/sc, amoswap, .aq and .rl
> annotation can be combined, and why combinations introduced in this
> patch are used.

Except that I don't understand that explanation, iow why said combination
of values could be observed even when using suffixes properly.

>>> +    uint8_t new_val_pos = ((unsigned long)(ptr) & (0x4 -
>>> sizeof(*ptr))) * BITS_PER_BYTE; \
>>
>> Why uint8_t?
> It is enough to cover possible start bit position of value that should
> be updated, so I decided to use uint8_t.

Please take a look at the "Types" section in ./CODING_STYLE.

>>> +    { \
>>> +    case 1: \
>>> +    case 2: \
>>> +        ret__ = emulate_xchg_1_2(ptr, new__, sfx, pre, post); \
>>> +        break; \
>>> +    case 4: \
>>> +        __amoswap_generic(ptr, new__, ret__,\
>>> +                          ".w" sfx,  pre, post); \
>>> +        break; \
>>> +    case 8: \
>>> +        __amoswap_generic(ptr, new__, ret__,\
>>> +                          ".d" sfx,  pre, post); \
>>> +        break; \
>>
>> In io.h you make sure to avoid rv64-only insns. Here you don't. The
>> build
>> would fail either way, but this still looks inconsistent.
>>
>> Also nit: Stray double blands (twice) ahead of "pre". Plus with this
>> style
>> of line continuation you want to consistently have exactly one blank
>> ahead
>> of each backslash.
>>
>>> +    default: \
>>> +        STATIC_ASSERT_UNREACHABLE(); \
>>> +    } \
>>> +    ret__; \
>>> +})
>>> +
>>> +#define xchg_relaxed(ptr, x) \
>>> +({ \
>>> +    __typeof__(*(ptr)) x_ = (x); \
>>
>> What is the purpose of this, when __xchg_generic() already does this
>> same
>> type conversion?
>>
>>> +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),
>>> "", "", ""); \
>>> +})
>>> +
>>> +#define xchg_acquire(ptr, x) \
>>> +({ \
>>> +    __typeof__(*(ptr)) x_ = (x); \
>>> +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \
>>> +                                       "", "",
>>> RISCV_ACQUIRE_BARRIER); \
>>> +})
>>> +
>>> +#define xchg_release(ptr, x) \
>>> +({ \
>>> +    __typeof__(*(ptr)) x_ = (x); \
>>> +    (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\
>>> +                                       "", RISCV_RELEASE_BARRIER,
>>> ""); \
>>> +})
>>
>> As asked before: Are there going to be any uses of these three?
>> Common
>> code doesn't require them. And not needing to provide them would
>> simplify things quite a bit, it seems.
> I checked my private branches and it looks to me that I introduced them
> only for the correspondent atomic operations ( which was copied from
> Linux Kernel ) which are not also used.
> 
> So we could definitely drop these macros for now, but should
> xchg_generic() be updated as well? If to look at:
>  #define xchg(ptr, x) __xchg_generic(ptr, (unsigned long)(x), sizeof(*
> (ptr)), \
>                                     ".aqrl", "", "")
> Last two arguments start to be unneeded, but I've wanted to leave them,
> in case someone will needed to back xchg_{release, acquire, ...}. Does
> it make any sense?

It all depends on how it's justified in the description.

>>> +#define xchg(ptr, x) __xchg_generic(ptr, (unsigned long)(x),
>>> sizeof(*(ptr)), \
>>> +                                    ".aqrl", "", "")
>>
>> According to the earlier comment (where I don't follow the example
>> given),
>> is .aqrl sufficient here? And even if it was for the 4- and 8-byte
>> cases,
>> is it sufficient in the 1- and 2-byte emulation case (where it then
>> is
>> appended to just the SC)?
> If I understand your question correctly then accroding to the spec.,
> .aqrl is enough for amo<op>.{w|d} instructions:
>    Linux Construct        RVWMO AMO Mapping
>    atomic <op> relaxed    amo<op>.{w|d}
>    atomic <op> acquire    amo<op>.{w|d}.aq
>    atomic <op> release    amo<op>.{w|d}.rl
>    atomic <op>            amo<op>.{w|d}.aqrl
> but in case of lr/sc you are right sc requires suffix too:
>    Linux Construct        RVWMO LR/SC Mapping
>    atomic <op> relaxed    loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
>    atomic <op> acquire    loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
>    atomic <op> release    loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez 
>    loop OR fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ; bnez loop
>    atomic <op>            loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez
>    loop
>    
> I will add sc_sfx to emulate_xchg_1_2(). The only question is left if
> __xchg_generic(ptr, new, size, sfx, pre, post) should be changed to:
> __xchg_generic(ptr, new, size, sfx1, sfx2, pre, post) to cover both
> cases amo<op>.{w|d}.sfx1 and lr.{w|d}.sfx1 ... sc.{w|d}.sfx2?

I expect that's going to be necessary. In the end you'll see what's needed
when making the code adjustment.

Jan

Re: [PATCH v5 11/23] xen/riscv: introduce cmpxchg.h

Reply via email to