> On Sat, Nov 2, 2013 at 6:48 PM, Steven Bosscher <stevenb....@gmail.com> wrote: > > The failure of pr53199.c is because of different instruction selection > > for bswap. Test case is reduced to just one function: [snip] > > Is this an improvement or a regression? If it's an improvement then > > these two test cases should be adjusted :-)
As David said, going through memory is bad, we get a load-hit-store flush. Definitely a regression on power7. Does anyone know why the bswapdi2_64bit r,r alternative is disparaged? Seems like it has been that way since the orginal mainline commit. int main (void) { int i; long ret = 0; long tmp1, tmp2, tmp3; for (i = 0; i < 1000000000; i++) #if MEM == 1 /* From pr53199.c reg_reverse, -mlra -mcpu=power6 -mtune=power7. */ __asm__ __volatile__ ("\ addi %1,1,-16\n\ srdi %3,%0,32\n\ li %2,4\n\ stwbrx %0,0,%1\n\ stwbrx %3,%2,%1\n\ ld %0,-16(1)" : "+r" (ret), "=&b" (tmp1), "=&r" (tmp2), "=&r" (tmp3)); #elif MEM == 2 /* From pr53199.c reg_reverse, -mlra -mcpu=power6. */ __asm__ __volatile__ ("\ addi %1,1,-16\n\ srdi %3,%0,32\n\ addi %2,%1,4\n\ stwbrx %0,0,%1\n\ stwbrx %3,0,%2\n\ ld %0,-16(1)" : "+r" (ret), "=&b" (tmp1), "=&b" (tmp2), "=&r" (tmp3)); #elif MEM == 3 /* From pr53199.c reg_reverse, -mlra -mcpu=power7. */ __asm__ __volatile__ ("\ std %0,-16(1)\n\ addi %1,1,-16\n\ ldbrx %0,0,%1\n" : "+r" (ret), "=&b" (tmp1)); #else __asm__ __volatile__ ("\ srdi %1,%0,32\n\ rlwinm %2,%0,8,0xffffffff\n\ rlwinm %3,%1,8,0xffffffff\n\ rlwimi %2,%0,24,0,7\n\ rlwimi %2,%0,24,16,23\n\ rlwimi %3,%1,24,0,7\n\ rlwimi %3,%1,24,16,23\n\ sldi %2,%2,32\n\ or %2,%2,%3\n\ mr %0,%2" : "+r" (ret), "=&r" (tmp1), "=&r" (tmp2), "=&r" (tmp3)); #endif return ret; } /* amodra@bns:~> gcc -O2 bswap_mem.c amodra@bns:~> time ./a.out real 0m3.096s user 0m3.089s sys 0m0.001s amodra@bns:~> time ./a.out real 0m3.096s user 0m3.094s sys 0m0.002s amodra@bns:~> gcc -O2 -DMEM=1 bswap_mem.c amodra@bns:~> time ./a.out real 0m12.661s user 0m12.657s sys 0m0.003s amodra@bns:~> time ./a.out real 0m12.660s user 0m12.657s sys 0m0.003s amodra@bns:~> gcc -O2 -DMEM=2 bswap_mem.c amodra@bns:~> time ./a.out real 0m12.660s user 0m12.657s sys 0m0.003s amodra@bns:~> time ./a.out real 0m12.660s user 0m12.657s sys 0m0.004s amodra@bns:~> gcc -O2 -DMEM=3 bswap_mem.c amodra@bns:~> time ./a.out real 0m10.279s user 0m10.276s sys 0m0.003s amodra@bns:~> time ./a.out real 0m10.279s user 0m10.276s sys 0m0.003s I also looked at the register version and -DMEM=1 case with power7 simulators finding that the register version had a delay of 12 cycles from completion of the first instruction to completion of the last. The -DMEM=1 case had a corresponding delay of 49 cycles, which matches the loop timing above quite well. */ -- Alan Modra Australia Development Lab, IBM