The current 64 bit csum_partial_copy_generic function is based on
the 32 bit version and never was optimized for 64 bit. This patch
takes the 64 bit memcpy and adapts it to also do the sum. It has
been tested on a variety of input sizes and alignments on Power5
and Power6 processors. It gives correct output for all cases
tested. It also runs 20-55% faster than the implemention it
replaces depending on size, alignment, and processor.
I think there is still some room for improvement in the unaligned
case, but given that it is much faster than what we have now I
figured I'd send it out.
Did you consider the other alternative? If you work on 32-bit chunks
instead of 64-bit chunks (either load them with lwz, or split them
after loading with ld), you can add them up with a regular non-carrying
add, which isn't serialising like adde; this also allows unrolling the
loop (using several accumulators instead of just one). Since your
registers are 64-bit, you can sum 16GB of data before ever getting a
carry out.
Or maybe the bottleneck here is purely the memory bandwidth?
Signed-off-by: Joel Schopp<[EMAIL PROTECTED]>
You missed a space there.
Segher
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev