On 04/10/2015 02:08 PM, Ingo Molnar wrote: > > * Ingo Molnar <mi...@kernel.org> wrote: > >> So restructure the loop a bit, to get much tighter code: >> >> 0000000000000030 <mutex_spin_on_owner.isra.5>: >> 30: 55 push %rbp >> 31: 65 48 8b 14 25 00 00 mov %gs:0x0,%rdx >> 38: 00 00 >> 3a: 48 89 e5 mov %rsp,%rbp >> 3d: 48 39 37 cmp %rsi,(%rdi) >> 40: 75 1e jne 60 >> <mutex_spin_on_owner.isra.5+0x30> >> 42: 8b 46 28 mov 0x28(%rsi),%eax >> 45: 85 c0 test %eax,%eax >> 47: 74 0d je 56 >> <mutex_spin_on_owner.isra.5+0x26> >> 49: f3 90 pause >> 4b: 48 8b 82 10 c0 ff ff mov -0x3ff0(%rdx),%rax >> 52: a8 08 test $0x8,%al >> 54: 74 e7 je 3d >> <mutex_spin_on_owner.isra.5+0xd> >> 56: 31 c0 xor %eax,%eax >> 58: 5d pop %rbp >> 59: c3 retq >> 5a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) >> 60: b8 01 00 00 00 mov $0x1,%eax >> 65: 5d pop %rbp >> 66: c3 retq > > Btw., totally off topic, the following NOP caught my attention: > >> 5a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) > > That's a dead NOP that boats the function a bit, added for the 16 byte > alignment of one of the jump targets. > > I realize that x86 CPU manufacturers recommend 16-byte jump target > alignments (it's in the Intel optimization manual), but the cost of > that is very significant: > > text data bss dec filename > 12566391 1617840 1089536 15273767 vmlinux.align.16-byte > 12224951 1617840 1089536 14932327 vmlinux.align.1-byte > > By using 1 byte jump target alignment (i.e. no alignment at all) we > get an almost 3% reduction in kernel size (!) - and a probably similar > reduction in I$ footprint. > > So I'm wondering, is the 16 byte jump target optimization suggestion > really worth this price? The patch below boots fine and I've not > measured any noticeable slowdown, but I've not tried hard. > > Now, the usual justification for jump target alignment is the > following: with 16 byte instruction-cache cacheline sizes, if a > forward jump is aligned to cacheline boundary then prefetches will > start from a new cacheline. > > But I think that argument is flawed for typical optimized kernel code > flows: forward jumps often go to 'cold' (uncommon) pieces of code, and > aligning cold code to cache lines does not bring a lot of advantages > (they are uncommon), while it causes collateral damage: > > - their alignment 'spreads out' the cache footprint, it shifts > followup hot code further out > > - plus it slows down even 'cold' code that immediately follows 'hot' > code (like in the above case), which could have benefited from the > partial cacheline that comes off the end of hot code. > > What do you guys think about this? I think we should seriously > consider relaxing our alignment defaults.
Looks like nobody objected. I think it's ok to submit this patch for real. > + # Align jump targets to 1 byte, not the default 16 bytes: > + KBUILD_CFLAGS += -falign-jumps=1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/