On Thu, May 24, 2018 at 08:20:16AM +0200, Christophe LEROY wrote: > Le 23/05/2018 à 20:34, Segher Boessenkool a écrit : > >On Tue, May 22, 2018 at 08:57:01AM +0200, Christophe Leroy wrote: > >>+_GLOBAL(csum_ipv6_magic) > >>+ lwz r8, 0(r3) > >>+ lwz r9, 4(r3) > >>+ lwz r10, 8(r3) > >>+ lwz r11, 12(r3) > >>+ addc r0, r5, r6 > >>+ adde r0, r0, r7 > >>+ adde r0, r0, r8 > >>+ adde r0, r0, r9 > >>+ adde r0, r0, r10 > >>+ adde r0, r0, r11 > >>+ lwz r8, 0(r4) > >>+ lwz r9, 4(r4) > >>+ lwz r10, 8(r4) > >>+ lwz r11, 12(r4) > >>+ adde r0, r0, r8 > >>+ adde r0, r0, r9 > >>+ adde r0, r0, r10 > >>+ adde r0, r0, r11 > >>+ addze r0, r0 > >>+ rotlwi r3, r0, 16 > >>+ add r3, r0, r3 > >>+ not r3, r3 > >>+ rlwinm r3, r3, 16, 16, 31 > >>+ blr > >>+EXPORT_SYMBOL(csum_ipv6_magic) > > > >Clustering the loads and carry insns together is pretty much the worst you > >can do on most 32-bit CPUs. > > Oh, really ? __csum_partial is written that way too.
I thought I told you about this before? Maybe not. > Right, now I tried interleaving the lwz and adde. I get no improvment at > all on a 885, but I get a 15% improvment on a 8321. It won't likely help on single-issue cores (like the one 885 has), yes. Segher