On Tue, Mar 07, 2006 at 07:50:52PM +0100, Andi Kleen wrote: > > My vmlinux has > > ffffffff80278382 <pfn_to_page>: > ffffffff80278382: 8b 0d 78 ea 41 00 mov 4319864(%rip),%ecx > # ffffffff80696e00 <memnode_shift> > ffffffff80278388: 48 89 f8 mov %rdi,%rax > ffffffff8027838b: 48 c1 e0 0c shl $0xc,%rax > ffffffff8027838f: 48 d3 e8 shr %cl,%rax > ffffffff80278392: 48 0f b6 80 00 5e 69 movzbq > 0xffffffff80695e00(%rax),%rax > ffffffff80278399: 80 > ffffffff8027839a: 48 8b 14 c5 40 93 71 mov > 0xffffffff80719340(,%rax,8),%rdx > ffffffff802783a1: 80 > ffffffff802783a2: 48 2b ba 40 36 00 00 sub 0x3640(%rdx),%rdi > ffffffff802783a9: 48 6b c7 38 imul $0x38,%rdi,%rax > ffffffff802783ad: 48 03 82 30 36 00 00 add 0x3630(%rdx),%rax > ffffffff802783b4: c3 retq
That's easily in the 90+ cycles range as you've got 3 data dependant loads which will hit in the L2, but likely not in the L1 given that the workload is manipulating lots of data. Assuming the instruction scheduler gets things right. > ffffffff802783b5 <page_to_pfn>: > ffffffff802783b5: 48 8b 07 mov (%rdi),%rax > ffffffff802783b8: 48 c1 e8 38 shr $0x38,%rax > ffffffff802783bc: 48 8b 14 c5 80 97 71 mov > 0xffffffff80719780(,%rax,8),%rdx > ffffffff802783c3: 80 > ffffffff802783c4: 48 b8 b7 6d db b6 6d mov > $0x6db6db6db6db6db7,%rax > ffffffff802783cb: db b6 6d > ffffffff802783ce: 48 2b ba 20 03 00 00 sub 0x320(%rdx),%rdi > ffffffff802783d5: 48 c1 ff 03 sar $0x3,%rdi > ffffffff802783d9: 48 0f af f8 imul %rax,%rdi > ffffffff802783dd: 48 03 ba 28 03 00 00 add 0x328(%rdx),%rdi > ffffffff802783e4: 48 89 f8 mov %rdi,%rax > ffffffff802783e7: c3 retq > > > Both look quite optimized to me. I haven't timed them but it would surprise > me > if P4 needed more than 20 cycles to crunch through each of them. It's more than that because you've got the data dependancies on the load. Yes, imul is 10 cycles, but shift is 1. > Where is that idiv exactly? I don't see it. My memory seems to be failing me, I can't find it. Whoops. > Only in pathological workloads. Normally the working set is so large > that the probability of two pages are near each other is very small. It's hardly that uncommon for pages to cross cachelines or for pages to move around CPUs with networking. Remember that we're using pages for the data buffers in networking, so you'll have pages get freed on the wrong CPU quite often. Please name some sort of benchmarks that show your concerns for decreased performance. I've shown you one that gets improved, and I think the pages not overlapping cachelines is only a good thing. I know these things look like piddly little worthless optimizations, but they add up big time. Mea culpa for not having a 10Gbit nic to show more "real world" applications. -ben - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html