Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

Alexander Duyck Fri, 26 Feb 2016 19:41:13 -0800

On Fri, Feb 26, 2016 at 7:11 PM, Tom Herbert <t...@herbertland.com> wrote:
> On Fri, Feb 26, 2016 at 2:52 PM, Alexander Duyck
> <alexander.du...@gmail.com> wrote:
>> On Fri, Feb 26, 2016 at 12:03 PM, Tom Herbert <t...@herbertland.com> wrote:
>>> This patch implements performant csum_partial for x86_64. The intent is
>>> to speed up checksum calculation, particularly for smaller lengths such
>>> as those that are present when doing skb_postpull_rcsum when getting
>>> CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion.
>>>
>>> - v4
>>>    - went back to C code with inline assembly for critical routines
>>>    - implemented suggestion from Linus to deal with lengths < 8
>>>
>>> Testing:
>>>
>>> Correctness:
>>>
>>> Verified correctness by testing arbitrary length buffer filled with
>>> random data. For each buffer I compared the computed checksum
>>> using the original algorithm for each possible alignment (0-7 bytes).
>>>
>>> Performance:
>>>
>>> Isolating old and new implementation for some common cases:
>>>
>>>              Old      New     %
>>>     Len/Aln  nsecs    nsecs   Improv
>>>     --------+-------+--------+-------
>>>     1400/0    195.6    181.7   7%     (Big packet)
>>>     40/0      11.4     6.2     45%    (Ipv6 hdr cmn case)
>>>     8/4       7.9      3.2     59%    (UDP, VXLAN in IPv4)
>>>     14/0      8.9      5.9     33%    (Eth hdr)
>>>     14/4      9.2      5.9     35%    (Eth hdr in IPv4)
>>>     14/3      9.6      5.9     38%    (Eth with odd align)
>>>     20/0      9.0      6.2     31%    (IP hdr without options)
>>>     7/1       8.9      4.2     52%    (buffer in one quad)
>>>     100/0    17.4     13.9     20%    (medium-sized pkt)
>>>     100/2    17.8     14.2     20%    (medium-sized pkt w/ alignment)
>>>
>>> Results from: Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
>>>
>>> Also tested on these with similar results:
>>>
>>> Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
>>> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>>>
>>> Branch  prediction:
>>>
>>> To test the effects of poor branch prediction in the jump tables I
>>> tested checksum performance with runs for two combinations of length
>>> and alignment. As the baseline I performed the test by doing half of
>>> calls with the first combination, followed by using the second
>>> combination for the second half. In the test case, I interleave the
>>> two combinations so that in every call the length and alignment are
>>> different to defeat the effects of branch prediction. Running several
>>> cases, I did not see any material performance difference between the
>>> two scenarios (perf stat output is below), neither does either case
>>> show a significant number of branch misses.
>>>
>>> Interleave lengths case:
>>>
>>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>>     ./csum -M new-thrash -I -l 100 -S 24 -a 1 -c 100000000
>>>
>>>  Performance counter stats for './csum -M new-thrash -I -l 100 -S 24 -a 1 
>>> -c 100000000' (10 runs):
>>>
>>>      9,556,693,202      instructions               ( +-  0.00% )
>>>      1,176,208,640       branches                                           
>>>           ( +-  0.00% )
>>>             19,487       branch-misses            #    0.00% of all 
>>> branches          ( +-  6.07% )
>>>
>>>        2.049732539 seconds time elapsed
>>>
>>>     Non-interleave case:
>>>
>>> perf stat --repeat 10 -e '{instructions, branches, branch-misses}' \
>>>      ./csum -M new-thrash -l 100 -S 24 -a 1 -c 100000000
>>>
>>> Performance counter stats for './csum -M new-thrash -l 100 -S 24 -a 1 -c 
>>> 100000000' (10 runs):
>>>
>>>      9,782,188,310      instructions               ( +-  0.00% )
>>>      1,251,286,958       branches                                           
>>>           ( +-  0.01% )
>>>             18,950       branch-misses            #    0.00% of all 
>>> branches          ( +- 12.74% )
>>>
>>>        2.271789046 seconds time elapsed
>>>
>>> Signed-off-by: Tom Herbert <t...@herbertland.com>
>>> ---
>>>  arch/x86/include/asm/checksum_64.h |  21 ++++
>>>  arch/x86/lib/csum-partial_64.c     | 225 
>>> ++++++++++++++++++++-----------------
>>>  2 files changed, 143 insertions(+), 103 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/checksum_64.h 
>>> b/arch/x86/include/asm/checksum_64.h
>>> index cd00e17..e20c35b 100644
>>> --- a/arch/x86/include/asm/checksum_64.h
>>> +++ b/arch/x86/include/asm/checksum_64.h
>>> @@ -188,6 +188,27 @@ static inline unsigned add32_with_carry(unsigned a, 
>>> unsigned b)
>>>         return a;
>>>  }
>>>
>>> +static inline unsigned long add64_with_carry(unsigned long a, unsigned 
>>> long b)
>>> +{
>>> +       asm("addq %2,%0\n\t"
>>> +           "adcq $0,%0"
>>> +           : "=r" (a)
>>> +           : "0" (a), "rm" (b));
>>> +       return a;
>>> +}
>>> +
>>> +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int 
>>> b,
>>> +                                            unsigned int c)
>>> +{
>>> +       asm("addl %2,%0\n\t"
>>> +           "adcl %3,%0\n\t"
>>> +           "adcl $0,%0"
>>> +           : "=r" (a)
>>> +           : "" (a), "rm" (b), "rm" (c));
>>> +
>>> +       return a;
>>> +}
>>> +
>>>  #define HAVE_ARCH_CSUM_ADD
>>>  static inline __wsum csum_add(__wsum csum, __wsum addend)
>>>  {
>>> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
>>> index 9845371..df82c9b 100644
>>> --- a/arch/x86/lib/csum-partial_64.c
>>> +++ b/arch/x86/lib/csum-partial_64.c
>>> @@ -8,114 +8,66 @@
>>>  #include <linux/compiler.h>
>>>  #include <linux/module.h>
>>>  #include <asm/checksum.h>
>>> +#include <asm/word-at-a-time.h>
>>>
>>> -static inline unsigned short from32to16(unsigned a)
>>> +static inline unsigned long rotate_by8_if_odd(unsigned long sum, bool 
>>> aligned)
>>>  {
>>> -       unsigned short b = a >> 16;
>>> -       asm("addw %w2,%w0\n\t"
>>> -           "adcw $0,%w0\n"
>>> -           : "=r" (b)
>>> -           : "0" (b), "r" (a));
>>> -       return b;
>>> +       if (unlikely(aligned))
>>> +               asm("rorq $0x8,%0"
>>> +                       : "=r" (sum)
>>> +                       : "0" (sum));
>>> +       return sum;
>>>  }
>>
>> Instead of making this conditional you might just want to perform the
>> rotate always and just bump the aligned value up bits.  That way you
>> reduce the length of the dependency chain by 1, and rotates tend to
>> only take a cycle on most x86_64 capable systems anyways.
>>
> I don't understand what you want. Can you provide more detail?


So right now you have aligned as a boolean value.  The code path ends
up being something like cmp, jmp, rorq.  If you converted it to an
integer value you could simply shift aligned by 3 and then perform the
rotate in all cases as a rotate by 0 has no effect.  It basically just
drops one uop from the path.

>>>
>>> -/*
>>> - * Do a 64-bit checksum on an arbitrary memory area.
>>> - * Returns a 32bit checksum.
>>> - *
>>> - * This isn't as time critical as it used to be because many NICs
>>> - * do hardware checksumming these days.
>>> - *
>>> - * Things tried and found to not make it faster:
>>> - * Manual Prefetching
>>> - * Unrolling to an 128 bytes inner loop.
>>> - * Using interleaving with more registers to break the carry chains.
>>> - */
>>> -static unsigned do_csum(const unsigned char *buff, unsigned len)
>>> +/* Extract eight high order (nbo) bytes of quad "val". */
>>> +static inline unsigned long csum_partial_lt8_head(unsigned long val, int 
>>> len)
>>>  {
>>> -       unsigned odd, count;
>>> -       unsigned long result = 0;
>>> +       return val & ((1ul << len * 8) - 1);
>>> +}
>>>
>>> -       if (unlikely(len == 0))
>>> -               return result;
>>> -       odd = 1 & (unsigned long) buff;
>>> -       if (unlikely(odd)) {
>>> -               result = *buff << 8;
>>> -               len--;
>>> -               buff++;
>>> -       }
>>> -       count = len >> 1;               /* nr of 16-bit words.. */
>>> -       if (count) {
>>> -               if (2 & (unsigned long) buff) {
>>> -                       result += *(unsigned short *)buff;
>>> -                       count--;
>>> -                       len -= 2;
>>> -                       buff += 2;
>>> -               }
>>> -               count >>= 1;            /* nr of 32-bit words.. */
>>> -               if (count) {
>>> -                       unsigned long zero;
>>> -                       unsigned count64;
>>> -                       if (4 & (unsigned long) buff) {
>>> -                               result += *(unsigned int *) buff;
>>> -                               count--;
>>> -                               len -= 4;
>>> -                               buff += 4;
>>> -                       }
>>> -                       count >>= 1;    /* nr of 64-bit words.. */
>>> -
>>> -                       /* main loop using 64byte blocks */
>>> -                       zero = 0;
>>> -                       count64 = count >> 3;
>>> -                       while (count64) {
>>> -                               asm("addq 0*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 1*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 2*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 3*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 4*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 5*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 6*8(%[src]),%[res]\n\t"
>>> -                                   "adcq 7*8(%[src]),%[res]\n\t"
>>> -                                   "adcq %[zero],%[res]"
>>> -                                   : [res] "=r" (result)
>>> -                                   : [src] "r" (buff), [zero] "r" (zero),
>>> -                                   "[res]" (result));
>>> -                               buff += 64;
>>> -                               count64--;
>>> -                       }
>>> -
>>> -                       /* last up to 7 8byte blocks */
>>> -                       count %= 8;
>>> -                       while (count) {
>>> -                               asm("addq %1,%0\n\t"
>>> -                                   "adcq %2,%0\n"
>>> -                                           : "=r" (result)
>>> -                                   : "m" (*(unsigned long *)buff),
>>> -                                   "r" (zero),  "0" (result));
>>> -                               --count;
>>> -                                       buff += 8;
>>> -                       }
>>> -                       result = add32_with_carry(result>>32,
>>> -                                                 result&0xffffffff);
>>> -
>>> -                       if (len & 4) {
>>> -                               result += *(unsigned int *) buff;
>>> -                               buff += 4;
>>> -                       }
>>> -               }
>>> -               if (len & 2) {
>>> -                       result += *(unsigned short *) buff;
>>> -                       buff += 2;
>>> -               }
>>> -       }
>>> -       if (len & 1)
>>> -               result += *buff;
>>> -       result = add32_with_carry(result>>32, result & 0xffffffff);
>>> -       if (unlikely(odd)) {
>>> -               result = from32to16(result);
>>> -               result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
>>> +/* Extract eight low order (nbo) bytes of quad in "val". */
>>> +static inline unsigned long csum_partial_lt8_tail(unsigned long val, int 
>>> len)
>>> +{
>>> +       return val >> (8 * (8 - len));
>>> +}
>>
>> From what I can tell these functions only get used in one or two
>> spots.  It may be worth it to just place the code directly in the
>> function rather than obscuring it by placing it in a function.
>>
>>> +/* Sum over buffer up to 64 bytes. */
>>> +static unsigned long csum_partial_le64(const void *buff, unsigned int len,
>>> +                                      unsigned long sum)
>>> +{
>>> +       asm("lea 40f(, %[slen], 4), %%r11\n\t"
>>> +           "clc\n\t"
>>> +           "jmpq *%%r11\n\t"
>>> +           "adcq 7*8(%[src]),%[res]\n\t"
>>
>> Technically this first adcq could just be an addq.  You don't have
>> anything to carry anyway.
>>
> To what end?

It basically allows us to get back to what I was proposing when we
were at netconf where we look at merging the main loop and your jump
table path so that you can put together a Duff's device and save
yourself some code overhead.

>>> +           "adcq 6*8(%[src]),%[res]\n\t"
>>> +           "adcq 5*8(%[src]),%[res]\n\t"
>>> +           "adcq 4*8(%[src]),%[res]\n\t"
>>> +           "adcq 3*8(%[src]),%[res]\n\t"
>>> +           "adcq 2*8(%[src]),%[res]\n\t"
>>> +           "adcq 1*8(%[src]),%[res]\n\t"
>>> +           "adcq 0*8(%[src]),%[res]\n\t"
>>> +           "nop\n\t"
>>> +           "40: adcq $0,%[res]"
>>> +                   : [res] "=r" (sum)
>>> +                   : [src] "r" (buff),
>>> +                     [slen] "r" (-((unsigned long)(len >> 3))), "[res]" 
>>> (sum)
>>> +                   : "r11");
>>> +
>>
>> It would probably useful to add a comment explaining that this block.
>> I had to compile it to verify my understanding of this is correct in
>> that you are using the label 40 as a jump target and simply
>> subtracting the length in longs from that jump target.  If nothing
>> else you might want to rename the label from 40 to something else just
>> to improve the readability.  It is possible that the code could get
>> shifted around or inlined and at that point the 40 becomes completely
>> meaningless.
>>
>> Also it wouldn't hurt to explain that the nop, and the reason for the
>> 4 in the lea instruction is because adcq is 4 bytes per instruction
>> with a memory offset, and the last one has an offset of 0 so we need
>> to add the nop to keep everything 4 byte aligned from the end.
>>
>>> +       if (len & 0x7) {
>>> +               unsigned long val;
>>> +               /*
>>> +                * Since "len" is > 8 here we backtrack in the buffer to 
>>> load
>>> +                * the outstaning bytes into the low order bytes of a quad 
>>> and
>>> +                * sum those in csum_partial_lt8_tail. By doing this we 
>>> avoid
>>> +                * additional calls to load_unaligned_zeropad.
>>> +                */
>>> +
>>> +               val = csum_partial_lt8_tail(*(unsigned long *)(buff + len - 
>>> 8),
>>> +                                           len & 0x7);
>>> +               sum = add64_with_carry(val, sum);
>>>         }
>>> -       return result;
>>> +
>>> +       return sum;
>>>  }
>>>
>>>  /*
>>> @@ -130,10 +82,77 @@ static unsigned do_csum(const unsigned char *buff, 
>>> unsigned len)
>>>   *
>>>   * it's best to have buff aligned on a 64-bit boundary
>>>   */
>>> +static unsigned int do_csum(const void *buff, int len, unsigned int sum)
>>> +{
>>> +       bool aligned = false;
>>> +       unsigned long result = 0;
>>> +
>>> +       /*
>>> +        * For "small" lengths don't bother with alignment, x86 handles
>>> +        * this pretty well.
>>> +        */
>>> +       if (len <= 8) {
>>> +               unsigned long val;
>>> +
>>> +               /* Optimize trivial cases */
>>> +               if (len == 8)
>>> +                       return add32_with_carry3(sum,
>>> +                                      *(unsigned int *)buff,
>>> +                                      *(unsigned int *)(buff + 4));
>>
>> Doesn't this still run the risk of triggering a fault if the buffer
>> crosses pages on a non-4 byte aligned boundary?
>>
> It's safe because we know the length of the buffer is eight bytes.
> There is only one instance where we need load_unaligned_zeropad to
> deal with potentially crossing over to a bad page.

I think I see what you are getting at, but to me it just seems like
there is some unnecessary optimizations going on.  For example I don't
really see the point in optimizing for the 0 case below.  If someone
requested a sum over 0 bytes I believe the lt8_head function will
return 0 anyway.

You might also be better off just converting this to a (len < 7) as
well and let the code fall through into the section for less than 64
as well.

>>> +               else if (len == 0)
>>> +                       return sum;
>>> +
>>> +               val = load_unaligned_zeropad(buff);
>>> +               result = csum_partial_lt8_head(val, len);
>>> +
>>> +               return add32_with_carry3(sum, result >> 32,
>>> +                                        result & 0xffffffff);
>>> +       } else if (len <= 64) {
>>> +               result = csum_partial_le64(buff, len, result);
>>> +
>>> +               return add32_with_carry3(sum, result >> 32,
>>> +                                        result & 0xffffffff);
>>> +       }
>>> +
>>
>> One side effect I see from the fact that you are calling
>> csum_partial_le64 in two spots is that you are ending up having to
>> make a call to it instead of having it be inlined directly into your
>> do_csum function.  What you may want to do is look at altering your
>> code flow so that if (len > 64) you do the bits below that occur
>> before you call into csum_partial_le64, and that way you will reduce
>> the number of callers to csum_partial_le64 to one which should get the
>> compiler to inline it into your function.
>>
> I see that gcc is inlining both calls to csum_partial_le64.

Maybe your compiler is smarter than mine, or I have some other build
option enabled that isn't triggering that.

>>> +       /*
>>> +        * Length is greater than 64. Sum to eight byte alignment before
>>> +        * proceeding with main loop.
>>> +        */
>>> +       aligned = !!((unsigned long)buff & 0x1);
>>> +       if (aligned) {
>>> +               unsigned int align = 7 & -(unsigned long)buff;
>>> +
>>> +               result = csum_partial_lt8_head(*(unsigned long *)buff, 
>>> align);
>>> +               buff += align;
>>> +               len -= align;
>>> +               result = rotate_by8_if_odd(result, align);
>>> +       }
>>
>> I'm still not a fan of the unaligned reads.  They may be okay but it
>> just seems like we are going run into corner cases all over the place
>> where this ends up biting us.
>>
> They are already all over the place in stack. For instance, any time
> we process an IP header within VXLAN we're likely doing several
> unaligned reads to load IP addresses etc. These are a big problem for
> architectures like Sparc, but does not appear to be an issue for x86.

Yes, but in our case we can align ourselves with minimal overhead,
that is why I figure it might be worth the effort to do so.

>>> +       /* main loop using 64byte blocks */
>>> +       for (; len >= 64; len -= 64, buff += 64) {
>>> +               asm("addq 0*8(%[src]),%[res]\n\t"
>>> +                   "adcq 1*8(%[src]),%[res]\n\t"
>>> +                   "adcq 2*8(%[src]),%[res]\n\t"
>>> +                   "adcq 3*8(%[src]),%[res]\n\t"
>>> +                   "adcq 4*8(%[src]),%[res]\n\t"
>>> +                   "adcq 5*8(%[src]),%[res]\n\t"
>>> +                   "adcq 6*8(%[src]),%[res]\n\t"
>>> +                   "adcq 7*8(%[src]),%[res]\n\t"
>>> +                   "adcq $0,%[res]"
>>> +                   : [res] "=r" (result)
>>> +                   : [src] "r" (buff),
>>> +                   "[res]" (result));
>>> +       }
>>> +
>>> +       result = csum_partial_le64(buff, len, result);
>>> +       result = rotate_by8_if_odd(result, aligned);
>>> +
>>> +       return add32_with_carry3(sum, result >> 32, result & 0xffffffff);
>>> +}
>>> +

I'll play around with this code a bit over the weekend.  I have a few
ideas I want to play with.

I figure we should be able to get to the point of having this code
running pretty lean.  For example I was looking at the Duff's device
idea and I think we can probably get this code quite a bit smaller and
faster.

As far as the patch itself I have no objections.  It is better then
what we have now.  I think I am just starting to realize we might be
able to squeeze some more speed out of it and might even be able to
reduce the code size in the process.

- Alex

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

Reply via email to