Re: [PATCH/RFC] 64 bit csum_partial_copy_generic

Joel Schopp Thu, 11 Sep 2008 10:45:15 -0700

Did you consider the other alternative?  If you work on 32-bit chunks
instead of 64-bit chunks (either load them with lwz, or split them
after loading with ld), you can add them up with a regular non-carrying
add, which isn't serialising like adde; this also allows unrolling the
loop (using several accumulators instead of just one).  Since your
registers are 64-bit, you can sum 16GB of data before ever getting a
carry out.


Or maybe the bottleneck here is purely the memory bandwidth?

I think the main bottleneck is the bandwidth/latency of memory.

When I sent the patch out I hadn't thought about eliminating the e fromthe add with 32 bit chunks. So I went off and tried it today andconverting the existing function to use just add instead of adde (sinceit was only doing 32 bits already) and got 1.5% - 15.7% faster onPower5, which is nice, but was still way behind the new function inevery testcase. I then added 1 level of unrolling to that (using 2accumulators) and got 59% slower to 10% faster on Power5 depending oninput. It seems quite a bit slower than I would have expected (I wouldhave expected basically even), but thats what got measured. The commentin the existing function indicates unrolling the loop doesn't helpbecause the bdnz has zero overhead, so I guess the unrolling hurt morethan I expected.


In any case I have now thought about it and don't think it will work out.

Signed-off-by: Joel Schopp<[EMAIL PROTECTED]>


You missed a space there.

If at first you don't succeed...

Signed-off-by: Joel Schopp <[EMAIL PROTECTED]>
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PATCH/RFC] 64 bit csum_partial_copy_generic

Reply via email to