Hi Arch, all, Thanks for looking into this, it's amazing to have experts here who understand the depths of compilers. I'm stubbornly having difficulty reproducing your timings, even though I see the same assembly generated for clang. I've tried on an i5-3320M and on an E5-2650 and on both, Julia is faster. How were you measuring the times in nanoseconds?
Miles On Thursday, January 29, 2015 at 12:20:46 PM UTC-5, Arch Robison wrote: > > I can't replicate the the 2x difference. The C is faster for me. But I > have gcc 4.8.2, not gcc 4.9.1. Nonetheless, the experiment points out > where Julia is missing a loop optimization that Clang and gcc get. Here is > a summary of combinations that I tried on a i7-4770 @ 3.40 GHz. > > - Julia 0.3.5: *70* nsec. Inner loop is: > > L82: vmulsd XMM3, XMM1, XMM2 > > vmulsd XMM4, XMM1, XMM1 > > vsubsd XMM4, XMM4, XMM0 > > vdivsd XMM3, XMM4, XMM3 > > vaddsd XMM1, XMM1, XMM3 > > vmulsd XMM3, XMM1, XMM1 > > vsubsd XMM3, XMM3, XMM0 > > vmovq RDX, XMM3 > > and RDX, RAX > > vmovq XMM3, RDX > > vucomisd XMM3, QWORD PTR [RCX] > > ja L82 > > > - Julia trunk from around Jan 19 + LLVM 3.5: *61 *nsec. Inner loop is: > > L80: vmulsd xmm4, xmm1, xmm1 > > vsubsd xmm4, xmm4, xmm0 > > vmulsd xmm5, xmm1, xmm2 > > vdivsd xmm4, xmm4, xmm5 > > vaddsd xmm1, xmm1, xmm4 > > vmulsd xmm4, xmm1, xmm1 > > vsubsd xmm4, xmm4, xmm0 > > vandpd xmm4, xmm4, xmm3 > > vucomisd xmm4, qword ptr [rax] > > ja L80 > > > > The abs is done more efficiently than for Julia 0.3.5 because of PR #8364. > LLVM missed a CSE opportunity here because of loop rotation: the last > vmulsd of each iteration computes the same thing as the first vmulsd of the > next iteration. > > > - C code compiled with gcc 4.8.2, using gcc -O2 -std=c99 > -march=native -mno-fma: *46 *nsec > > .L11: > > vaddsd %xmm1, %xmm1, %xmm3 > > vdivsd %xmm3, %xmm2, %xmm2 > > vsubsd %xmm2, %xmm1, %xmm1 > > vmulsd %xmm1, %xmm1, %xmm2 > > vsubsd %xmm0, %xmm2, %xmm2 > > vmovapd %xmm2, %xmm3 > > vandpd %xmm5, %xmm3, %xmm3 > > vucomisd %xmm4, %xmm3 > > ja .L11 > > > Multiply by 2 (5 clock latency) has been replaced by add-to-self (3 clock > latency). It picked up the CSE opportunity. Only 1 vmulsd per iteration! > > > - C code compiled with clang 3.5.0, using clang -O2 -march=native: *46 > *nsec > > .LBB1_3: # %while.body > > # =>This Inner Loop Header: Depth=1 > > vmulsd %xmm3, %xmm1, %xmm5 > > vdivsd %xmm5, %xmm2, %xmm2 > > vaddsd %xmm2, %xmm1, %xmm1 > > vmulsd %xmm1, %xmm1, %xmm2 > > vsubsd %xmm0, %xmm2, %xmm2 > > vandpd %xmm4, %xmm2, %xmm5 > > vucomisd .LCPI1_1(%rip), %xmm5 > > ja .LBB1_3 > > > Clang picks up the CSE opportunity but misses the add-to-self opportunity > (xmm3=-2.0). It's also using LLVM. > We should check why Julia is missing the CSE opportunity. Maybe Clang is > running a pass that handles CSE for a rotated loop? Though looking at the > Julia pass list, it looks like CSE runs before loop rotation. Needs more > investigation. > > > - Arch > >
