Re: [PATCH v2 2/2] powerpc32: optimise csum_partial() loop

2015-08-17 Thread leroy christophe
Le 17/08/2015 13:00, leroy christophe a écrit : Le 17/08/2015 12:56, leroy christophe a écrit : Le 07/08/2015 01:25, Segher Boessenkool a écrit : On Thu, Aug 06, 2015 at 05:45:45PM -0500, Scott Wood wrote: If this makes performance non-negligibly worse on other 32-bit chips, and is an im

Re: [PATCH v2 2/2] powerpc32: optimise csum_partial() loop

2015-08-17 Thread leroy christophe
Le 17/08/2015 12:56, leroy christophe a écrit : Le 07/08/2015 01:25, Segher Boessenkool a écrit : On Thu, Aug 06, 2015 at 05:45:45PM -0500, Scott Wood wrote: If this makes performance non-negligibly worse on other 32-bit chips, and is an important improvement on 8xx, then we can use an ifde

Re: [PATCH v2 2/2] powerpc32: optimise csum_partial() loop

2015-08-17 Thread leroy christophe
Le 07/08/2015 01:25, Segher Boessenkool a écrit : On Thu, Aug 06, 2015 at 05:45:45PM -0500, Scott Wood wrote: If this makes performance non-negligibly worse on other 32-bit chips, and is an important improvement on 8xx, then we can use an ifdef since 8xx already requires its own kernel build.

Re: [PATCH v2 2/2] powerpc32: optimise csum_partial() loop

2015-08-06 Thread Segher Boessenkool
On Thu, Aug 06, 2015 at 05:45:45PM -0500, Scott Wood wrote: > > The original loop was already optimal, as the comment said. > > The comment says that bdnz has zero overhead. That doesn't mean the adde > won't stall waiting for the load result. adde is execution serialising on those cores; it *a

Re: [PATCH v2 2/2] powerpc32: optimise csum_partial() loop

2015-08-06 Thread Scott Wood
On Wed, 2015-08-05 at 23:39 -0500, Segher Boessenkool wrote: > On Wed, Aug 05, 2015 at 09:31:41PM -0500, Scott Wood wrote: > > On Wed, 2015-08-05 at 19:30 -0500, Segher Boessenkool wrote: > > > On Wed, Aug 05, 2015 at 03:29:35PM +0200, Christophe Leroy wrote: > > > > On the 8xx, load latency is 2 c

Re: [PATCH v2 2/2] powerpc32: optimise csum_partial() loop

2015-08-05 Thread Segher Boessenkool
On Wed, Aug 05, 2015 at 09:31:41PM -0500, Scott Wood wrote: > On Wed, 2015-08-05 at 19:30 -0500, Segher Boessenkool wrote: > > On Wed, Aug 05, 2015 at 03:29:35PM +0200, Christophe Leroy wrote: > > > On the 8xx, load latency is 2 cycles and taking branches also takes > > > 2 cycles. So let's unroll

Re: [PATCH v2 2/2] powerpc32: optimise csum_partial() loop

2015-08-05 Thread Scott Wood
On Wed, 2015-08-05 at 19:30 -0500, Segher Boessenkool wrote: > On Wed, Aug 05, 2015 at 03:29:35PM +0200, Christophe Leroy wrote: > > On the 8xx, load latency is 2 cycles and taking branches also takes > > 2 cycles. So let's unroll the loop. > > This is not true for most other 32-bit PowerPC; this

Re: [PATCH v2 2/2] powerpc32: optimise csum_partial() loop

2015-08-05 Thread Segher Boessenkool
On Wed, Aug 05, 2015 at 03:29:35PM +0200, Christophe Leroy wrote: > On the 8xx, load latency is 2 cycles and taking branches also takes > 2 cycles. So let's unroll the loop. This is not true for most other 32-bit PowerPC; this patch makes performance worse on e.g. 6xx/7xx/7xxx. Let's not! Seghe

[PATCH v2 2/2] powerpc32: optimise csum_partial() loop

2015-08-05 Thread Christophe Leroy
On the 8xx, load latency is 2 cycles and taking branches also takes 2 cycles. So let's unroll the loop. Signed-off-by: Christophe Leroy --- v2: Only use lwzu for the last load as lwzu has undocumented additional latency arch/powerpc/lib/checksum_32.S | 16 +++- 1 file chang