From: Robin Murphy > Sent: 15 May 2019 11:58 > To: David Laight; 'Will Deacon' > Cc: Zhangshaokun; Ard Biesheuvel; linux-arm-ker...@lists.infradead.org; > netdev@vger.kernel.org; > ilias.apalodi...@linaro.org; huanglingyan (A); steve.cap...@arm.com > Subject: Re: [PATCH] arm64: do_csum: implement accelerated scalar version > > On 15/05/2019 11:15, David Laight wrote: > > ... > >>> ptr = (u64 *)(buff - offset); > >>> shift = offset * 8; > >>> > >>> /* > >>> * Head: zero out any excess leading bytes. Shifting back by the same > >>> * amount should be at least as fast as any other way of handling the > >>> * odd/even alignment, and means we can ignore it until the very end. > >>> */ > >>> data = *ptr++; > >>> #ifdef __LITTLE_ENDIAN > >>> data = (data >> shift) << shift; > >>> #else > >>> data = (data << shift) >> shift; > >>> #endif > > > > I suspect that > > #ifdef __LITTLE_ENDIAN > > data &= ~0ull << shift; > > #else > > data &= ~0ull >> shift; > > #endif > > is likely to be better. > > Out of interest, better in which respects? For the A64 ISA at least, > that would take 3 instructions plus an additional scratch register, e.g.: > > MOV x2, #~0 > LSL x2, x2, x1 > AND x0, x0, x1 > > (alternatively "AND x0, x0, x1 LSL x2" to save 4 bytes of code, but that > will typically take as many cycles if not more than just pipelining the > two 'simple' ALU instructions) > > Whereas the original is just two shift instruction in-place. > > LSR x0, x0, x1 > LSL x0, x0, x1 > > If the operation were repeated, the constant generation could certainly > be amortised over multiple subsequent ANDs for a net win, but that isn't > the case here.
On a superscaler processor you reduce the register dependency chain by one instruction. The original code is pretty much a single dependency chain so you are likely to be able to generate the mask 'for free'. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)