On Tuesday 06 March 2007 18:28, Eric Dumazet wrote: > On Tuesday 06 March 2007 18:19, Linus Torvalds wrote: > > > > Using reciprocal divides permits to change each divide by two > > > multiplies, less expensive on current CPUS. > > > > Are you sure? > > I am going to test this, but at least on Opterons, the reciprocal divide I > added into mm/slab.c gave me a nice speedup. >
With attached test program (one million calls to pipe()), I got about 0.1 s of speedup on my 1.6 GHz Pentium(M) dell D610 machine (3.350 s instead of 3.450 s, on many runs) Thats about 100 ns saved per number() call But then I realized that on ia32, gcc compilers is not very smart : static inline u32 reciprocal_divide(u32 A, u32 R) { return (u32)(((u64)A * R) >> 32); } It generates two multiplies... arg... //begin of reciprocal_divide() 4b0: 8b 4c 24 28 mov 0x28(%esp),%ecx 4b4: 89 f0 mov %esi,%eax 4b6: f7 64 24 24 mull 0x24(%esp) 4ba: 0f af ce imul %esi,%ecx 4bd: 8d 14 11 lea (%ecx,%edx,1),%edx // end of reciprocal_divide() 4c0: 8b 8c 24 8c 00 00 00 mov 0x8c(%esp),%ecx 4c7: 89 d0 mov %edx,%eax 4c9: 0f af c8 imul %eax,%ecx 4cc: 29 ce sub %ecx,%esi 4ce: 8b 4c 24 1c mov 0x1c(%esp),%ecx 4d2: 0f b6 34 31 movzbl (%ecx,%esi,1),%esi 4d6: 89 f1 mov %esi,%ecx 4d8: 89 c6 mov %eax,%esi 4da: 88 0f mov %cl,(%edi) 4dc: 47 inc %edi 4dd: ff 44 24 20 incl 0x20(%esp) 4e1: 85 c0 test %eax,%eax 4e3: 75 cb jne 4b0 <number+0x160> So even with a total of 3 multiplies per digit, we win... Maybe some bit of x86 asm is needed to make gcc be smarter (using only one multiply for reciprocal_divide())
/* * micro benchmark to time calls to pipe()/close() */ main() { int fd[100*2]; unsigned int l, i; for (l = 0 ; l < 10000 ; l++) { for (i = 0 ; i < 100*2 ; i+=2) pipe(fd + i); for (i = 0 ; i < 100*2 ; i++) close(fd[i]); } }