On Mon, 9 Jul 2007, Davide Libenzi wrote: > > So in this box, and in this test, the double-short Z-lock seems faster > than a double-byte. I've no idea why, since it uses two ops more and an > extra register.
At this kind of level, the exact instruction scheduling can make a big difference. The extra register usage won't matter if there is no register pressure, and any extra instructions can actually happen to *help*, if they end up just aligning something just the right way. There can also be various random effects of prefixes: decoding x86 instructions is basically a very uarch-specific issue, and for all we know it might be that the AMD setup may well end up behaving differently from most Intel chips (and within the Intel family, the netburst situation is likely different from the other P6-derived cores). For example, does a single prefix decode faster? It could be that the combination of "lock" _and_ "opsize" prefixes is problematic (as in a 16-bit locked "lock xaddw"), and causes a decode hickup, but that "lock" and "opsize" on their own don't cause any decoder issues (ie doing the "lock" on the 32-bit xadd, and just the "opsize" prefix on the 16-bit decw both are fast). But on another uarch it might work out the other way: if "lock" is always a complex op, then having a opsize prefix on that one might be "free", and then you're better combining them for the locked 16-bit xadd, and having the releasing "decb" not have any prefix at all. And regardless of that, just a random "it happened to get aligned that way" (where "alignment" might be about hitting the cache-line just right, but might also be about just having the right instruction mix to get the intel decoders to run at their full 4-1-1-1 capacity), causing the timing differences. So before taking these numbers as any kind of "real" values, I'd suggest: - trying it out on at least a few different uarchs (Opteron, P4 and Core 2 all have quite different restrictions on decoding) - possibly trying it out with things in different order and different compiler options (-O2 vs -Os), trying to cause different kinds of alignment issues. Also, just a small nit: in the kernel, the locking would _not_ be inlined (but the unlocking would), so marking the lock functions "inline" is probably a bad idea. Without the inline, it's likely more realistic, and the effects of register pressure will be hidden. Because of the uninlining nature of locks, I think you can generally ignore the "one or two registers" issue - you'll have three caller-clobbered registers to play with regardless. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/