> > + : "=r" (ret), "+A" (*ptr) \ > > + : "r" (new) \ > > + : "memory" ); \ > > +}) > > + > > +#define emulate_xchg_1_2(ptr, new, ret, release_barrier, > > acquire_barrier) \ > > +({ \ > > + uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned > > long)ptr, 4); \ > > You now appear to assume that this macro is only used with inputs not > crossing word boundaries. That's okay as long as suitably guaranteed > at the use sites, but imo wants saying in a comment. > > > + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr))) > > * BITS_PER_BYTE; \ > > Why 0x8 (i.e. spanning 64 bits), not 4 (matching the uint32_t use > above)? The idea to read 8 bytes was to deal with crossing word boundary. So if our address is 0x3 and we have to xchg() 2 bytes, what will cross 4 byte boundary. Instead we align add 0x3, so it will become 0x0 and then just always work with 8 bytes.
> > > + unsigned long new_ = (unsigned long)(new) << mask_l; \ > > + unsigned long ret_; \ > > + unsigned long rc; \ > > Similarly, why unsigned long here? sizeof(unsigned long) is 8 bytes and it was chosen as we are working with lc/sc.d which are working with 8 bytes. > > I also wonder about the mix of underscore suffixed (or not) variable > names here. If the question about ret_, then the same as before size of ret argument of the macros will be 1 or 2, but {lc/sc}.d expected to work with 8 bytes. > > > + release_barrier \ > > + "0: lr.d %0, %2\n" \ > > Even here it's an 8-byte access. Even if - didn't check - the insn > was > okay to use with just a 4-byte aligned pointer, wouldn't it make > sense > then to 8-byte align it, and be consistent throughout this macro wrt > the base unit acted upon? Alternatively, why not use lr.w here, thus > reducing possible collisions between multiple CPUs accessing the same > cache line? According to the docs: LR and SC operate on naturally-aligned 64-bit (RV64 only) or 32-bit words in memory. Misaligned addresses will generate misaligned address exceptions. My intention was to deal with 4-byte crossing boundary. so if ptr is 4- byte aligned then by reading 8-bytes we shouldn't care about boundary crossing, if I am not missing something. But your opinion about reduction of collisions makes sense also... > > > + " and %1, %0, %z4\n" \ > > + " or %1, %1, %z3\n" \ > > + " sc.d %1, %1, %2\n" \ > > + " bnez %1, 0b\n" \ > > + acquire_barrier \ > > + : "=&r" (ret_), "=&r" (rc), "+A" (*ptr_32b_aligned) \ > > + : "rJ" (new_), "rJ" (~mask) \ > > I think that as soon as there are more than 2 or maybe 3 operands, > legibility is vastly improved by using named asm() operands. Just to clarify you mean that it would be better to use instead of %0 use names? > > > + : "memory"); \ > > Nit: Missing blank before closing parenthesis. > > > + \ > > + ret = (__typeof__(*(ptr)))((ret_ & mask) >> mask_l); \ > > +}) > > Why does "ret" need to be a macro argument? If you had only the > expression here, not the the assigment, ... > > > +#define __xchg_generic(ptr, new, size, sfx, release_barrier, > > acquire_barrier) \ > > +({ \ > > + __typeof__(ptr) ptr__ = (ptr); \ > > Is this local variable really needed? Can't you use "ptr" directly > in the three macro invocations? > > > + __typeof__(*(ptr)) new__ = (new); \ > > + __typeof__(*(ptr)) ret__; \ > > + switch (size) \ > > + { \ > > + case 1: \ > > + case 2: \ > > + emulate_xchg_1_2(ptr__, new__, ret__, release_barrier, > > acquire_barrier); \ > > ... this would become > > ret__ = emulate_xchg_1_2(ptr__, new__, release_barrier, > acquire_barrier); \ > > But, unlike assumed above, there's no enforcement here that a 2-byte > quantity won't cross a word, double-word, cache line, or even page > boundary. That might be okay if then the code would simply crash > (like > the AMO insns emitted further down would), but aiui silent > misbehavior > would result. As I mentioned above with 4-byte alignment and then reading and working with 8-byte then crossing a word or double-word boundary shouldn't be an issue. I am not sure that I know how to check that we are crossing cache line boundary. Regarding page boundary, if the next page is mapped then all should work fine, otherwise it will be an exception. > > Also nit: The switch() higher up is (still/again) missing blanks. > > > + break; \ > > + case 4: \ > > + __amoswap_generic(ptr__, new__, ret__,\ > > + ".w" sfx, release_barrier, > > acquire_barrier); \ > > + break; \ > > + case 8: \ > > + __amoswap_generic(ptr__, new__, ret__,\ > > + ".d" sfx, release_barrier, > > acquire_barrier); \ > > + break; \ > > + default: \ > > + STATIC_ASSERT_UNREACHABLE(); \ > > + } \ > > + ret__; \ > > +}) > > + > > +#define xchg_relaxed(ptr, x) \ > > +({ \ > > + __typeof__(*(ptr)) x_ = (x); \ > > + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), > > "", "", ""); \ > > +}) > > + > > +#define xchg_acquire(ptr, x) \ > > +({ \ > > + __typeof__(*(ptr)) x_ = (x); \ > > + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \ > > + "", "", > > RISCV_ACQUIRE_BARRIER); \ > > +}) > > + > > +#define xchg_release(ptr, x) \ > > +({ \ > > + __typeof__(*(ptr)) x_ = (x); \ > > + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\ > > + "", RISCV_RELEASE_BARRIER, > > ""); \ > > +}) > > + > > +#define xchg(ptr,x) \ > > +({ \ > > + __typeof__(*(ptr)) ret__; \ > > + ret__ = (__typeof__(*(ptr))) \ > > + __xchg_generic(ptr, (unsigned long)(x), > > sizeof(*(ptr)), \ > > + ".aqrl", "", ""); \ > > The .aqrl doesn't look to affect the (emulated) 1- and 2-byte cases. > > Further, amoswap also exists in release-only and acquire-only forms. > Why do you prefer explicit barrier insns over those? (Looks to > similarly apply to the emulation path as well as to the cmpxchg > machinery then, as both lr and sc also come in all four possible > acquire/release forms. Perhaps for the emulation path using > explicit barriers is better, in case the acquire/release forms of > lr/sc - being used inside the loop - might perform worse.) As 1- and 2-byte cases are emulated I decided that is not to provide sfx argument for emulation macros as it will not have to much affect on emulated types and just consume more performance on acquire and release version of sc/ld instructions. > > > > No RISCV_..._BARRIER for use here and ... > > > + ret__; \ > > +}) > > + > > +#define __cmpxchg(ptr, o, n, s) \ > > +({ \ > > + __typeof__(*(ptr)) ret__; \ > > + ret__ = (__typeof__(*(ptr))) \ > > + __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned > > long)(n), \ > > + s, ".rl", "", " fence rw, rw\n"); \ > > ... here? And anyway, wouldn't it make sense to have > > #define cmpxchg(ptr, o, n) __cmpxchg(ptr, o, n, sizeof(*(ptr)) > > to limit redundancy? > > Plus wouldn't > > #define __cmpxchg(ptr, o, n, s) \ > ((__typeof__(*(ptr))) \ > __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \ > s, ".rl", "", " fence rw, rw\n")) > > be shorter and thus easier to follow as well? As I notice only now, > this would apparently apply further up as well. I understand your point about "#define cmpxchg(ptr, o, n) __cmpxchg(", but I can't undestand how the definition of __cmxchng should be done shorter. Could you please clarify that? ~ Oleksii