>
> Hi Segher,
>
>> Not really. Do you know how many 16/32-bit words you can add before a
>> 64-bit register can overflow? :-)
>
> Thats a very good point. I thought about using 32bit adds when writing
> the copy and checksum routine, but came to the conclusion that it wouldn't
> go
> any faster t
Hi Segher,
> Not really. Do you know how many 16/32-bit words you can add before a
> 64-bit register can overflow? :-)
Thats a very good point. I thought about using 32bit adds when writing
the copy and checksum routine, but came to the conclusion that it wouldn't go
any faster than one using a
On both POWER6 and POWER7 this should be as fast as we can go since
we are limited by the latency of the adde instructions.
Not really. Do you know how many 16/32-bit words you can add before a
64-bit register can overflow? :-)
If you ever have to call this with more than 16GB of data to sum, t
The main loop of csum_partial runs very slowly on recent POWER CPUs. After some
analysis on both POWER6 and POWER7 I came up with routine below. First we get
the source aligned to a double word, ignoring any odd alignment to keep things
simple. Then we do 64 bytes at a time, with an entry and exit