On Mon, Mar 7, 2016 at 5:56 AM, David Laight <david.lai...@aculab.com> wrote: > From: Alexander Duyck > ... >> Actually probably the easiest way to go on x86 is to just replace the >> use of len with (len >> 6) and use decl or incl instead of addl or >> subl, and lea instead of addq for the buff address. None of those >> instructions effect the carry flag as this is how such loops were >> intended to be implemented. >> >> I've been doing a bit of testing and that seems to work without >> needing the adcq until after you exit the loop, but doesn't give that >> much of a gain in speed for dropping the instruction from the >> hot-path. I suspect we are probably memory bottle-necked already in >> the loop so dropping an instruction or two doesn't gain you much. > > Right, any superscalar architecture gives you some instructions > 'for free' if they can execute at the same time as those on the > critical path (in this case the memory reads and the adc). > This is why loop unrolling can be pointless. > > So the loop: > 10: addc %rax,(%rdx,%rcx,8) > inc %rcx > jnz 10b > could easily be as fast as anything that doesn't use the 'new' > instructions that use the overflow flag. > That loop might be measurable faster for aligned buffers.
Tested by replacing the unrolled loop in my patch with just: if (len >= 8) { asm("clc\n\t" "0: adcq (%[src],%%rcx,8),%[res]\n\t" "decl %%ecx\n\t" "jge 0b\n\t" "adcq $0, %[res]\n\t" : [res] "=r" (result) : [src] "r" (buff), "[res]" (result), "c" ((len >> 3) - 1)); } This seems to be significantly slower: 1400 bytes: 797 nsecs vs. 202 nsecs 40 bytes: 6.5 nsecs vs. 26.8 nsecs Tom > > David >