Re: [PATCH/RFC] 64 bit csum_partial_copy_generic

Segher Boessenkool Thu, 11 Sep 2008 07:02:51 -0700

The current 64 bit csum_partial_copy_generic function is based onthe 32 bit version and never was optimized for 64 bit. This patchtakes the 64 bit memcpy and adapts it to also do the sum. It hasbeen tested on a variety of input sizes and alignments on Power5and Power6 processors. It gives correct output for all casestested. It also runs 20-55% faster than the implemention itreplaces depending on size, alignment, and processor.
I think there is still some room for improvement in the unalignedcase, but given that it is much faster than what we have now Ifigured I'd send it out.


Did you consider the other alternative?  If you work on 32-bit chunks
instead of 64-bit chunks (either load them with lwz, or split them
after loading with ld), you can add them up with a regular non-carrying
add, which isn't serialising like adde; this also allows unrolling the
loop (using several accumulators instead of just one).  Since your
registers are 64-bit, you can sum 16GB of data before ever getting a
carry out.

Or maybe the bottleneck here is purely the memory bandwidth?

Signed-off-by: Joel Schopp<[EMAIL PROTECTED]>


You missed a space there.


Segher

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PATCH/RFC] 64 bit csum_partial_copy_generic

Reply via email to