Re: [PATCH] arm64: do_csum: implement accelerated scalar version

Robin Murphy Mon, 15 Apr 2019 11:18:30 -0700

On 12/04/2019 10:52, Will Deacon wrote:

On Fri, Apr 12, 2019 at 10:31:16AM +0800, Zhangshaokun wrote:

On 2019/2/19 7:08, Ard Biesheuvel wrote:

It turns out that the IP checksumming code is still exercised often,
even though one might expect that modern NICs with checksum offload
have no use for it. However, as Lingyan points out, there are
combinations of features where the network stack may still fall back
to software checksumming, and so it makes sense to provide an
optimized implementation in software as well.


So provide an implementation of do_csum() in scalar assembler, which,
unlike C, gives direct access to the carry flag, making the code run
substantially faster. The routine uses overlapping 64 byte loads for
all input size > 64 bytes, in order to reduce the number of branches
and improve performance on cores with deep pipelines.

On Cortex-A57, this implementation is on par with Lingyan's NEON
implementation, and roughly 7x as fast as the generic C code.

Cc: "huanglingyan (A)" <huanglingy...@huawei.com>
Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
Test code after the patch.


Hi maintainers and Ard,

Any update on it?


I'm waiting for Robin to come back with numbers for a C implementation.

Robin -- did you get anywhere with that?

Still not what I would call finished, but where I've got so far (besidesan increasingly elaborate test rig) is as below - it still wants someunrolling in the middle to really fly (and actual testing on BE), butthe worst-case performance already equals or just beats this asm versionon Cortex-A53 with GCC 7 (by virtue of being alignment-insensitive andbranchless except for the loop). Unfortunately, the advantage of C codebeing instrumentable does also come around to bite me...


Robin.

----->8-----

/* Looks dumb, but generates nice-ish code */
static u64 accumulate(u64 sum, u64 data)
{
        __uint128_t tmp = (__uint128_t)sum + data;
        return tmp + (tmp >> 64);
}

unsigned int do_csum_c(const unsigned char *buff, int len)
{
        unsigned int offset, shift, sum, count;
        u64 data, *ptr;
        u64 sum64 = 0;

        offset = (unsigned long)buff & 0x7;
        /*
         * This is to all intents and purposes safe, since rounding down cannot
         * result in a different page or cache line being accessed, and @buff
         * should absolutely not be pointing to anything read-sensitive.
         * It does, however, piss off KASAN...
         */
        ptr = (u64 *)(buff - offset);
        shift = offset * 8;

        /*
         * Head: zero out any excess leading bytes. Shifting back by the same
         * amount should be at least as fast as any other way of handling the
         * odd/even alignment, and means we can ignore it until the very end.
         */
        data = *ptr++;
#ifdef __LITTLE_ENDIAN
        data = (data >> shift) << shift;
#else
        data = (data << shift) >> shift;
#endif
        count = 8 - offset;

        /* Body: straightforward aligned loads from here on... */
        //TODO: fancy stuff with larger strides and uint128s?
        while(len > count) {
                sum64 = accumulate(sum64, data);
                data = *ptr++;
                count += 8;
        }
        /*
         * Tail: zero any over-read bytes similarly to the head, again
         * preserving odd/even alignment.
         */
        shift = (count - len) * 8;
#ifdef __LITTLE_ENDIAN
        data = (data << shift) >> shift;
#else
        data = (data >> shift) << shift;
#endif
        sum64 = accumulate(sum64, data);

        /* Finally, folding */
        sum64 += (sum64 >> 32) | (sum64 << 32);
        sum = sum64 >> 32;
        sum += (sum >> 16) | (sum << 16);
        if (offset & 1)
                return (u16)swab32(sum);

        return sum >> 16;
}

Re: [PATCH] arm64: do_csum: implement accelerated scalar version

Reply via email to