On 2016年04月22日 00:13, Peter Zijlstra wrote: > On Thu, Apr 21, 2016 at 11:35:07PM +0800, Pan Xinhui wrote: >> yes, you are right. more load/store will be done in C code. >> However such xchg_u8/u16 is just used by qspinlock now. and I did not see >> any performance regression. >> So just wrote in C, for simple. :) > > Which is fine; but worthy of a note in your Changelog. > will do that.
>> Of course I have done xchg tests. >> we run code just like xchg((u8*)&v, j++); in several threads. >> and the result is, >> [ 768.374264] use time[1550072]ns in xchg_u8_asm >> [ 768.377102] use time[2826802]ns in xchg_u8_c >> >> I think this is because there is one more load in C. >> If possible, we can move such code in asm-generic/. > > So I'm not actually _that_ familiar with the PPC LL/SC implementation; > but there are things a CPU can do to optimize these loops. > > For example, a CPU might choose to not release the exclusive hold of the > line for a number of cycles, except when it passes SC or an interrupt > happens. This way there's a smaller chance the SC fails and inhibits > forward progress. I am not sure if there is such hardware optimization. > > By doing the modification outside of the LL/SC you loose such > advantages. > > And yes, doing a !exclusive load prior to the exclusive load leads to an > even bigger window where the data can get changed out from under you. > you are right. We have observed such data change during the two different loads.