On Fri, Apr 22, 2016 at 09:59:22AM +0800, Pan Xinhui wrote: > On 2016年04月21日 23:52, Boqun Feng wrote: > > On Thu, Apr 21, 2016 at 11:35:07PM +0800, Pan Xinhui wrote: > >> On 2016年04月20日 22:24, Peter Zijlstra wrote: > >>> On Wed, Apr 20, 2016 at 09:24:00PM +0800, Pan Xinhui wrote: > >>> > >>>> +#define __XCHG_GEN(cmp, type, sfx, skip, v) > >>>> \ > >>>> +static __always_inline unsigned long > >>>> \ > >>>> +__cmpxchg_u32##sfx(v unsigned int *p, unsigned long old, > >>>> \ > >>>> + unsigned long new); > >>>> \ > >>>> +static __always_inline u32 > >>>> \ > >>>> +__##cmp##xchg_##type##sfx(v void *ptr, u32 old, u32 new) > >>>> \ > >>>> +{ > >>>> \ > >>>> + int size = sizeof (type); > >>>> \ > >>>> + int off = (unsigned long)ptr % sizeof(u32); > >>>> \ > >>>> + volatile u32 *p = ptr - off; > >>>> \ > >>>> + int bitoff = BITOFF_CAL(size, off); > >>>> \ > >>>> + u32 bitmask = ((0x1 << size * BITS_PER_BYTE) - 1) << bitoff; > >>>> \ > >>>> + u32 oldv, newv, tmp; > >>>> \ > >>>> + u32 ret; > >>>> \ > >>>> + oldv = READ_ONCE(*p); > >>>> \ > >>>> + do { > >>>> \ > >>>> + ret = (oldv & bitmask) >> bitoff; > >>>> \ > >>>> + if (skip && ret != old) > >>>> \ > >>>> + break; > >>>> \ > >>>> + newv = (oldv & ~bitmask) | (new << bitoff); > >>>> \ > >>>> + tmp = oldv; > >>>> \ > >>>> + oldv = __cmpxchg_u32##sfx((v u32*)p, oldv, newv); > >>>> \ > >>>> + } while (tmp != oldv); > >>>> \ > >>>> + return ret; > >>>> \ > >>>> +} > >>> > >>> So for an LL/SC based arch using cmpxchg() like that is sub-optimal. > >>> > >>> Why did you choose to write it entirely in C? > >>> > >> yes, you are right. more load/store will be done in C code. > >> However such xchg_u8/u16 is just used by qspinlock now. and I did not see > >> any performance regression. > >> So just wrote in C, for simple. :) > >> > >> Of course I have done xchg tests. > >> we run code just like xchg((u8*)&v, j++); in several threads. > >> and the result is, > >> [ 768.374264] use time[1550072]ns in xchg_u8_asm > > > > How was xchg_u8_asm() implemented, using lbarx or using a 32bit ll/sc > > loop with shifting and masking in it? > > > yes, using 32bit ll/sc loops. > > looks like: > __asm__ __volatile__( > "1: lwarx %0,0,%3\n" > " and %1,%0,%5\n" > " or %1,%1,%4\n" > PPC405_ERR77(0,%2) > " stwcx. %1,0,%3\n" > " bne- 1b" > : "=&r" (_oldv), "=&r" (tmp), "+m" (*(volatile unsigned int *)_p) > : "r" (_p), "r" (_newv), "r" (_oldv_mask) > : "cc", "memory"); >
Good, so this works for all ppc ISAs too. Given the performance benefit(maybe caused by the reason Peter mentioned), I think we should use this as the implementation of u8/u16 {cmp}xchg for now. For Power7 and later, we can always switch to the lbarx/lharx version if observable performance benefit can be achieved. But the choice is left to you. After all, as you said, qspinlock is the only user ;-) Regards, Boqun > > > Regards, > > Boqun > > > >> [ 768.377102] use time[2826802]ns in xchg_u8_c > >> > >> I think this is because there is one more load in C. > >> If possible, we can move such code in asm-generic/. > >> > >> thanks > >> xinhui > >> >
signature.asc
Description: PGP signature
_______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev