On 12/04/2019 10:52, Will Deacon wrote:
On Fri, Apr 12, 2019 at 10:31:16AM +0800, Zhangshaokun wrote:
On 2019/2/19 7:08, Ard Biesheuvel wrote:
It turns out that the IP checksumming code is still exercised often,
even though one might expect that modern NICs with checksum offload
have no use for it. However, as Lingyan points out, there are
combinations of features where the network stack may still fall back
to software checksumming, and so it makes sense to provide an
optimized implementation in software as well.
So provide an implementation of do_csum() in scalar assembler, which,
unlike C, gives direct access to the carry flag, making the code run
substantially faster. The routine uses overlapping 64 byte loads for
all input size > 64 bytes, in order to reduce the number of branches
and improve performance on cores with deep pipelines.
On Cortex-A57, this implementation is on par with Lingyan's NEON
implementation, and roughly 7x as fast as the generic C code.
Cc: "huanglingyan (A)" <huanglingy...@huawei.com>
Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
Test code after the patch.
Hi maintainers and Ard,
Any update on it?
I'm waiting for Robin to come back with numbers for a C implementation.
Robin -- did you get anywhere with that?
Still not what I would call finished, but where I've got so far (besides
an increasingly elaborate test rig) is as below - it still wants some
unrolling in the middle to really fly (and actual testing on BE), but
the worst-case performance already equals or just beats this asm version
on Cortex-A53 with GCC 7 (by virtue of being alignment-insensitive and
branchless except for the loop). Unfortunately, the advantage of C code
being instrumentable does also come around to bite me...
Robin.
----->8-----
/* Looks dumb, but generates nice-ish code */
static u64 accumulate(u64 sum, u64 data)
{
__uint128_t tmp = (__uint128_t)sum + data;
return tmp + (tmp >> 64);
}
unsigned int do_csum_c(const unsigned char *buff, int len)
{
unsigned int offset, shift, sum, count;
u64 data, *ptr;
u64 sum64 = 0;
offset = (unsigned long)buff & 0x7;
/*
* This is to all intents and purposes safe, since rounding down cannot
* result in a different page or cache line being accessed, and @buff
* should absolutely not be pointing to anything read-sensitive.
* It does, however, piss off KASAN...
*/
ptr = (u64 *)(buff - offset);
shift = offset * 8;
/*
* Head: zero out any excess leading bytes. Shifting back by the same
* amount should be at least as fast as any other way of handling the
* odd/even alignment, and means we can ignore it until the very end.
*/
data = *ptr++;
#ifdef __LITTLE_ENDIAN
data = (data >> shift) << shift;
#else
data = (data << shift) >> shift;
#endif
count = 8 - offset;
/* Body: straightforward aligned loads from here on... */
//TODO: fancy stuff with larger strides and uint128s?
while(len > count) {
sum64 = accumulate(sum64, data);
data = *ptr++;
count += 8;
}
/*
* Tail: zero any over-read bytes similarly to the head, again
* preserving odd/even alignment.
*/
shift = (count - len) * 8;
#ifdef __LITTLE_ENDIAN
data = (data << shift) >> shift;
#else
data = (data >> shift) << shift;
#endif
sum64 = accumulate(sum64, data);
/* Finally, folding */
sum64 += (sum64 >> 32) | (sum64 << 32);
sum = sum64 >> 32;
sum += (sum >> 16) | (sum << 16);
if (offset & 1)
return (u16)swab32(sum);
return sum >> 16;
}