From: Tom Herbert > Sent: 03 February 2016 19:19 ... > + /* Main loop */ > +50: adcq 0*8(%rdi),%rax > + adcq 1*8(%rdi),%rax > + adcq 2*8(%rdi),%rax > + adcq 3*8(%rdi),%rax > + adcq 4*8(%rdi),%rax > + adcq 5*8(%rdi),%rax > + adcq 6*8(%rdi),%rax > + adcq 7*8(%rdi),%rax > + adcq 8*8(%rdi),%rax > + adcq 9*8(%rdi),%rax > + adcq 10*8(%rdi),%rax > + adcq 11*8(%rdi),%rax > + adcq 12*8(%rdi),%rax > + adcq 13*8(%rdi),%rax > + adcq 14*8(%rdi),%rax > + adcq 15*8(%rdi),%rax > + lea 128(%rdi), %rdi > + loop 50b
I'd need convincing that unrolling the loop like that gives any significant gain. You have a dependency chain on the carry flag so have delays between the 'adcq' instructions (these may be more significant than the memory reads from l1 cache). I also don't remember (might be wrong) the 'loop' instruction being executed quickly. If 'loop' is fast then you will probably find that: 10: adcq 0(%rdi),%rax lea 8(%rdi),%rdi loop 10b is just as fast since the three instructions could all be executed in parallel. But I suspect that 'dec %cx; jnz 10b' is actually better (and might execute as a single micro-op). IIRC 'adc' and 'dec' will both have dependencies on the flags register so cannot execute together (which is a shame here). It is also possible that breaking the carry-chain dependency by doing 32bit adds (possibly after 64bit reads) can be made to be faster. David