The current 64 bit csum_partial_copy_generic function is based on the 32 bit version and never was optimized for 64 bit. This patch takes the 64 bit memcpy and adapts it to also do the sum. It has been tested on a variety of input sizes and alignments on Power5 and Power6 processors. It gives correct output for all cases tested. It also runs 20-55% faster than the implemention it replaces depending on size, alignment, and processor.

I think there is still some room for improvement in the unaligned case, but given that it is much faster than what we have now I figured I'd send it out.

Did you consider the other alternative?  If you work on 32-bit chunks
instead of 64-bit chunks (either load them with lwz, or split them
after loading with ld), you can add them up with a regular non-carrying
add, which isn't serialising like adde; this also allows unrolling the
loop (using several accumulators instead of just one).  Since your
registers are 64-bit, you can sum 16GB of data before ever getting a
carry out.

Or maybe the bottleneck here is purely the memory bandwidth?

Signed-off-by: Joel Schopp<[EMAIL PROTECTED]>

You missed a space there.


Segher

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Reply via email to