Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-28 Thread Maciej W. Rozycki
On Sun, 28 Feb 2016, Alexander Duyck wrote: > I actually found the root cause. The problem is in add32_with_carry3. > > > +static inline unsigned int add32_with_carry3(unsigned int a, unsigned int > > b, > > +unsigned int c) > > +{ > > + asm("ad

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-28 Thread Alexander Duyck
On Sun, Feb 28, 2016 at 11:15 AM, Tom Herbert wrote: > On Sun, Feb 28, 2016 at 10:56 AM, Alexander Duyck > wrote: >> On Sat, Feb 27, 2016 at 12:30 AM, Alexander Duyck >> wrote: +{ + asm("lea 40f(, %[slen], 4), %%r11\n\t" + "clc\n\t" + "jmpq *%%r11\n\

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-28 Thread George Spelvin
I was just noticing that these two: > +static inline unsigned long add64_with_carry(unsigned long a, unsigned long > b) > +{ > + asm("addq %2,%0\n\t" > + "adcq $0,%0" > + : "=r" (a) > + : "0" (a), "rm" (b)); > + return a; > +} > + > +static inline unsigne

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-28 Thread Eric Dumazet
On ven., 2016-02-26 at 12:03 -0800, Tom Herbert wrote: > + > + /* > + * Length is greater than 64. Sum to eight byte alignment before > + * proceeding with main loop. > + */ > + aligned = !!((unsigned long)buff & 0x1); > + if (aligned) { > + unsigned int align

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-28 Thread Tom Herbert
On Sun, Feb 28, 2016 at 10:56 AM, Alexander Duyck wrote: > On Sat, Feb 27, 2016 at 12:30 AM, Alexander Duyck > wrote: >>> +{ >>> + asm("lea 40f(, %[slen], 4), %%r11\n\t" >>> + "clc\n\t" >>> + "jmpq *%%r11\n\t" >>> + "adcq 7*8(%[src]),%[res]\n\t" >>> +

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-28 Thread Alexander Duyck
On Sat, Feb 27, 2016 at 12:30 AM, Alexander Duyck wrote: >> +{ >> + asm("lea 40f(, %[slen], 4), %%r11\n\t" >> + "clc\n\t" >> + "jmpq *%%r11\n\t" >> + "adcq 7*8(%[src]),%[res]\n\t" >> + "adcq 6*8(%[src]),%[res]\n\t" >> + "adcq 5*8(%[src]),%[re

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-27 Thread Alexander Duyck
> +{ > + asm("lea 40f(, %[slen], 4), %%r11\n\t" > + "clc\n\t" > + "jmpq *%%r11\n\t" > + "adcq 7*8(%[src]),%[res]\n\t" > + "adcq 6*8(%[src]),%[res]\n\t" > + "adcq 5*8(%[src]),%[res]\n\t" > + "adcq 4*8(%[src]),%[res]\n\t" > +

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-26 Thread Linus Torvalds
On Fri, Feb 26, 2016 at 2:52 PM, Alexander Duyck wrote: > > I'm still not a fan of the unaligned reads. They may be okay but it > just seems like we are going run into corner cases all over the place > where this ends up biting us. No. Unaligned reads are not just "ok". The fact is, not doing

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-26 Thread Alexander Duyck
On Fri, Feb 26, 2016 at 7:11 PM, Tom Herbert wrote: > On Fri, Feb 26, 2016 at 2:52 PM, Alexander Duyck > wrote: >> On Fri, Feb 26, 2016 at 12:03 PM, Tom Herbert wrote: >>> This patch implements performant csum_partial for x86_64. The intent is >>> to speed up checksum calculation, particularly f

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-26 Thread Tom Herbert
On Fri, Feb 26, 2016 at 2:52 PM, Alexander Duyck wrote: > On Fri, Feb 26, 2016 at 12:03 PM, Tom Herbert wrote: >> This patch implements performant csum_partial for x86_64. The intent is >> to speed up checksum calculation, particularly for smaller lengths such >> as those that are present when do

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-26 Thread Alexander Duyck
On Fri, Feb 26, 2016 at 12:03 PM, Tom Herbert wrote: > This patch implements performant csum_partial for x86_64. The intent is > to speed up checksum calculation, particularly for smaller lengths such > as those that are present when doing skb_postpull_rcsum when getting > CHECKSUM_COMPLETE from d

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-26 Thread Tom Herbert
On Fri, Feb 26, 2016 at 12:29 PM, Linus Torvalds wrote: > Looks ok to me. > > I am left wondering if the code should just do that > > add32_with_carry3(sum, result >> 32, result); > > in the caller instead - right now pretty much every return point in > do_csum() effectively does that, with t

Re: [PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-26 Thread Linus Torvalds
Looks ok to me. I am left wondering if the code should just do that add32_with_carry3(sum, result >> 32, result); in the caller instead - right now pretty much every return point in do_csum() effectively does that, with the exception of - the 0-length case, which is presumably not really

[PATCH v4 net-next] net: Implement fast csum_partial for x86_64

2016-02-26 Thread Tom Herbert
This patch implements performant csum_partial for x86_64. The intent is to speed up checksum calculation, particularly for smaller lengths such as those that are present when doing skb_postpull_rcsum when getting CHECKSUM_COMPLETE from device or after CHECKSUM_UNNECESSARY conversion. - v4 - wen