Re: [PATCH] arm64: do_csum: implement accelerated scalar version

Will Deacon Wed, 15 May 2019 02:47:48 -0700

On Mon, Apr 15, 2019 at 07:18:22PM +0100, Robin Murphy wrote:
> On 12/04/2019 10:52, Will Deacon wrote:
> > I'm waiting for Robin to come back with numbers for a C implementation.
> > 
> > Robin -- did you get anywhere with that?
> 
> Still not what I would call finished, but where I've got so far (besides an
> increasingly elaborate test rig) is as below - it still wants some unrolling
> in the middle to really fly (and actual testing on BE), but the worst-case
> performance already equals or just beats this asm version on Cortex-A53 with
> GCC 7 (by virtue of being alignment-insensitive and branchless except for
> the loop). Unfortunately, the advantage of C code being instrumentable does
> also come around to bite me...


Is there any interest from anybody in spinning a proper patch out of this?
Shaokun?

Will

> /* Looks dumb, but generates nice-ish code */
> static u64 accumulate(u64 sum, u64 data)
> {
>       __uint128_t tmp = (__uint128_t)sum + data;
>       return tmp + (tmp >> 64);
> }
> 
> unsigned int do_csum_c(const unsigned char *buff, int len)
> {
>       unsigned int offset, shift, sum, count;
>       u64 data, *ptr;
>       u64 sum64 = 0;
> 
>       offset = (unsigned long)buff & 0x7;
>       /*
>        * This is to all intents and purposes safe, since rounding down cannot
>        * result in a different page or cache line being accessed, and @buff
>        * should absolutely not be pointing to anything read-sensitive.
>        * It does, however, piss off KASAN...
>        */
>       ptr = (u64 *)(buff - offset);
>       shift = offset * 8;
> 
>       /*
>        * Head: zero out any excess leading bytes. Shifting back by the same
>        * amount should be at least as fast as any other way of handling the
>        * odd/even alignment, and means we can ignore it until the very end.
>        */
>       data = *ptr++;
> #ifdef __LITTLE_ENDIAN
>       data = (data >> shift) << shift;
> #else
>       data = (data << shift) >> shift;
> #endif
>       count = 8 - offset;
> 
>       /* Body: straightforward aligned loads from here on... */
>       //TODO: fancy stuff with larger strides and uint128s?
>       while(len > count) {
>               sum64 = accumulate(sum64, data);
>               data = *ptr++;
>               count += 8;
>       }
>       /*
>        * Tail: zero any over-read bytes similarly to the head, again
>        * preserving odd/even alignment.
>        */
>       shift = (count - len) * 8;
> #ifdef __LITTLE_ENDIAN
>       data = (data << shift) >> shift;
> #else
>       data = (data >> shift) << shift;
> #endif
>       sum64 = accumulate(sum64, data);
> 
>       /* Finally, folding */
>       sum64 += (sum64 >> 32) | (sum64 << 32);
>       sum = sum64 >> 32;
>       sum += (sum >> 16) | (sum << 16);
>       if (offset & 1)
>               return (u16)swab32(sum);
> 
>       return sum >> 16;
> }

Re: [PATCH] arm64: do_csum: implement accelerated scalar version

Reply via email to