From: George Spelvin > Sent: 10 February 2016 14:44 ... > > I think the fastest loop is: > > 10: adcq 0(%rdi,%rcx,8),%rax > > inc %rcx > > jnz 10b > > That loop looks like it will have no overhead on recent cpu. > > Well, it should execute at 1 instruction/cycle.
I presume you do mean 1 adc/cycle. If it doesn't unrolling once might help. > (No, a scaled offset doesn't take extra time.) Maybe I'm remembering the 386 book. > To break that requires ADCX/ADOX: > > 10: adcxq 0(%rdi,%rcx),%rax > adoxq 8(%rdi,%rcx),%rdx > leaq 16(%rcx),%rcx > jrcxz 11f > j 10b > 11: Getting 2 adc/cycle probably does require a little unrolling. With luck the adcxq, adoxq and leaq will execute together. The jrcxz is two clocks - so definitely needs a second adcoxq/adcxq pair. Experiments would be needed to confirm guesses though. David