Hi Arch, all,

Thanks for looking into this, it's amazing to have experts here who 
understand the depths of compilers. I'm stubbornly having difficulty 
reproducing your timings, even though I see the same assembly generated for 
clang. I've tried on an i5-3320M and on an E5-2650 and on both, Julia is 
faster. How were you measuring the times in nanoseconds?

Miles

On Thursday, January 29, 2015 at 12:20:46 PM UTC-5, Arch Robison wrote:
>
> I can't replicate the the 2x difference.  The C is faster for me.  But I 
> have gcc 4.8.2, not gcc 4.9.1.  Nonetheless, the experiment points out 
> where Julia is missing a loop optimization that Clang and gcc get.  Here is 
> a summary of combinations that I tried on a i7-4770 @ 3.40 GHz.
>
>    - Julia 0.3.5: *70*  nsec.  Inner loop is:
>
> L82:    vmulsd  XMM3, XMM1, XMM2
>
>         vmulsd  XMM4, XMM1, XMM1
>
>         vsubsd  XMM4, XMM4, XMM0
>
>         vdivsd  XMM3, XMM4, XMM3
>
>         vaddsd  XMM1, XMM1, XMM3
>
>         vmulsd  XMM3, XMM1, XMM1
>
>         vsubsd  XMM3, XMM3, XMM0
>
>         vmovq   RDX, XMM3
>
>         and     RDX, RAX
>
>         vmovq   XMM3, RDX
>
>         vucomisd XMM3, QWORD PTR [RCX]
>
>         ja      L82
>
>
>    - Julia trunk from around Jan 19 + LLVM 3.5: *61 *nsec.  Inner loop is:
>
> L80:    vmulsd  xmm4, xmm1, xmm1
>
>         vsubsd  xmm4, xmm4, xmm0
>
>         vmulsd  xmm5, xmm1, xmm2
>
>         vdivsd  xmm4, xmm4, xmm5
>
>         vaddsd  xmm1, xmm1, xmm4
>
>         vmulsd  xmm4, xmm1, xmm1
>
>         vsubsd  xmm4, xmm4, xmm0
>
>         vandpd  xmm4, xmm4, xmm3
>
>         vucomisd xmm4, qword ptr [rax]
>
>         ja      L80
>
>  
>
> The abs is done more efficiently than for Julia 0.3.5 because of PR #8364.
>  LLVM missed a CSE opportunity here because of loop rotation: the last 
> vmulsd of each iteration computes the same thing as the first vmulsd of the 
> next iteration.  
>
>
>    -  C code compiled with gcc 4.8.2, using gcc -O2 -std=c99 
>    -march=native -mno-fma: *46 *nsec
>
> .L11:
>
>         vaddsd  %xmm1, %xmm1, %xmm3
>
>         vdivsd  %xmm3, %xmm2, %xmm2
>
>         vsubsd  %xmm2, %xmm1, %xmm1
>
>         vmulsd  %xmm1, %xmm1, %xmm2
>
>         vsubsd  %xmm0, %xmm2, %xmm2
>
>         vmovapd %xmm2, %xmm3
>
>         vandpd  %xmm5, %xmm3, %xmm3
>
>         vucomisd        %xmm4, %xmm3
>
>         ja      .L11
>
>
> Multiply by 2 (5 clock latency) has been replaced by add-to-self (3 clock 
> latency).  It picked up the CSE opportunity.  Only 1 vmulsd per iteration!
>
>
>    - C code compiled with clang 3.5.0, using clang -O2 -march=native: *46 
>    *nsec
>
> .LBB1_3:                                # %while.body
>
>                                         # =>This Inner Loop Header: Depth=1
>
>         vmulsd  %xmm3, %xmm1, %xmm5
>
>         vdivsd  %xmm5, %xmm2, %xmm2
>
>         vaddsd  %xmm2, %xmm1, %xmm1
>
>         vmulsd  %xmm1, %xmm1, %xmm2
>
>         vsubsd  %xmm0, %xmm2, %xmm2
>
>         vandpd  %xmm4, %xmm2, %xmm5
>
>         vucomisd        .LCPI1_1(%rip), %xmm5
>
>         ja      .LBB1_3
>
>
> Clang picks up the CSE opportunity but misses the add-to-self opportunity 
> (xmm3=-2.0).   It's also using LLVM.  
> We should check why Julia is missing the CSE opportunity.  Maybe Clang is 
> running a pass that handles CSE for a rotated loop?  Though looking at the 
> Julia pass list, it looks like CSE runs before loop rotation.  Needs more 
> investigation.
>
>
> - Arch 
>  
>

Reply via email to