Re: queued spinlock code and results

Davide Libenzi Mon, 09 Jul 2007 12:48:38 -0700

On Mon, 9 Jul 2007, Linus Torvalds wrote:

> On Mon, 9 Jul 2007, Davide Libenzi wrote:
> > 
> > So in this box, and in this test, the double-short Z-lock seems faster 
> > than a double-byte. I've no idea why, since it uses two ops more and an 
> > extra register.
> 
> At this kind of level, the exact instruction scheduling can make a big 
> difference.
> 
> The extra register usage won't matter if there is no register pressure, 
> and any extra instructions can actually happen to *help*, if they end up 
> just aligning something just the right way.
> 
> There can also be various random effects of prefixes: decoding x86 
> instructions is basically a very uarch-specific issue, and for all we know 
> it might be that the AMD setup may well end up behaving differently from 
> most Intel chips (and within the Intel family, the netburst situation is 
> likely different from the other P6-derived cores).
> 
> For example, does a single prefix decode faster? It could be that the 
> combination of "lock" _and_ "opsize" prefixes is problematic (as in a 
> 16-bit locked "lock xaddw"), and causes a decode hickup, but that "lock" 
> and "opsize" on their own don't cause any decoder issues (ie doing the 
> "lock" on the 32-bit xadd, and just the "opsize" prefix on the 16-bit decw 
> both are fast).
> 
> But on another uarch it might work out the other way: if "lock" is always 
> a complex op, then having a opsize prefix on that one might be "free", and 
> then you're better combining them for the locked 16-bit xadd, and having 
> the releasing "decb" not have any prefix at all.
> 
> And regardless of that, just a random "it happened to get aligned that 
> way" (where "alignment" might be about hitting the cache-line just right, 
> but might also be about just having the right instruction mix to get the 
> intel decoders to run at their full 4-1-1-1 capacity), causing the timing 
> differences.
> 
> So before taking these numbers as any kind of "real" values, I'd suggest:
> 
>  - trying it out on at least a few different uarchs (Opteron, P4 and Core 
>    2 all have quite different restrictions on decoding)
> 
>  - possibly trying it out with things in different order and different 
>    compiler options (-O2 vs -Os), trying to cause different kinds of 
>    alignment issues.
> 
> Also, just a small nit: in the kernel, the locking would _not_ be inlined 
> (but the unlocking would), so marking the lock functions "inline" is 
> probably a bad idea. Without the inline, it's likely more realistic, and 
> the effects of register pressure will be hidden. Because of the uninlining 
> nature of locks, I think you can generally ignore the "one or two 
> registers" issue - you'll have three caller-clobbered registers to play 
> with regardless.


Indeed, with no inline, on a P4 (with -O2), numbers betweeen xadd-lock 
and zadd-lock gets closer:

inc-lock in cache takes 35.15ns
xadd-lock in cache takes 43.84ns
vadd-lock in cache takes 53.51ns
zadd-lock in cache takes 43.28ns
inc-lock out of cache takes 122.92ns
xadd-lock out of cache takes 126.98ns
vadd-lock out of cache takes 172.36ns
zadd-lock out of cache takes 126.01ns

The always-lfence instruction in vadd-lock really is painfull though.
If numbers are close, and given that spinlock size considering structure 
alignments should not matter much, wouldn't it be better to use a double 
short and remove the 256 CPUs cap?



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: queued spinlock code and results

Reply via email to