Re: [julia-users] benchmark with Julia 2x faster than C

Erik Schnetter Tue, 27 Jan 2015 18:23:04 -0800

The C version looks better than the Julia version...

My first guess would be a problem with the timing code.


Another option (but I'm out of my depth here and only wildly guessing) is a 
partial register stall in the C version. It first performs operations on the 
xmm registers as scalars, and then one vector operation (vandpd). The Julia 
code does not do this -- it performs the "and" operation (i.e. fabs) in an 
integer register instead. I don't know whether the conditions for a partial 
register stall are met, but if this is the case, it could cause problems with 
out-of-order execution.

-erik

> On Jan 27, 2015, at 19:30 , Miles Lubin <[email protected]> wrote:
> 
> That was with gcc 4.9.1 -O2 -march=native, by the way. Here's clang 3.5.0 -O2 
> -march=native:
> 
> 0000000000400570 <squareroot>:
>   400570: c5 fb 59 c8           vmulsd %xmm0,%xmm0,%xmm1
>   400574: c5 f3 5c d0           vsubsd %xmm0,%xmm1,%xmm2
>   400578: c5 e9 54 0d 80 04 00 vandpd 0x480(%rip),%xmm2,%xmm1        # 400a00 
> <_IO_stdin_used+0x60>
>   40057f: 00
>   400580: c5 f9 2e 0d 28 04 00 vucomisd 0x428(%rip),%xmm1        # 4009b0 
> <_IO_stdin_used+0x10>
>   400587: 00
>   400588: 76 3d                 jbe    4005c7 <squareroot+0x57>
>   40058a: c5 fb 10 1d 26 04 00 vmovsd 0x426(%rip),%xmm3        # 4009b8 
> <_IO_stdin_used+0x18>
>   400591: 00
>   400592: c5 fb 10 25 66 04 00 vmovsd 0x466(%rip),%xmm4        # 400a00 
> <_IO_stdin_used+0x60>
>   400599: 00
>   40059a: c5 f8 28 c8           vmovaps %xmm0,%xmm1
>   40059e: 66 90                 xchg   %ax,%ax
>   4005a0: c5 f3 59 eb           vmulsd %xmm3,%xmm1,%xmm5
>   4005a4: c5 eb 5e d5           vdivsd %xmm5,%xmm2,%xmm2
>   4005a8: c5 f3 58 ca           vaddsd %xmm2,%xmm1,%xmm1
>   4005ac: c5 f3 59 d1           vmulsd %xmm1,%xmm1,%xmm2
>   4005b0: c5 eb 5c d0           vsubsd %xmm0,%xmm2,%xmm2
>   4005b4: c5 e9 54 ec           vandpd %xmm4,%xmm2,%xmm5
>   4005b8: c5 f9 2e 2d f0 03 00 vucomisd 0x3f0(%rip),%xmm5        # 4009b0 
> <_IO_stdin_used+0x10>
>   4005bf: 00
>   4005c0: 77 de                 ja     4005a0 <squareroot+0x30>
>   4005c2: c5 f8 28 c1           vmovaps %xmm1,%xmm0
>   4005c6: c3                   retq
>   4005c7: c3                   retq
>   4005c8: 0f 1f 84 00 00 00 00 nopl   0x0(%rax,%rax,1)
>   4005cf: 00
> 
> 
> On Tuesday, January 27, 2015 at 7:17:58 PM UTC-5, Miles Lubin wrote:
> Here's the Julia assembly:
> 
> function squareroot(x)
>     it = x
>     err = it*it-x # line 5
>     while abs(err) > 1e-13
>         it = it - (err)/(2it) # line 7
>         err = it*it-x
>     end
>     return it
> end
> Source line: 5
>  push RBP
>  mov RBP, RSP
> Source line: 5
>  vmulsd XMM1, XMM0, XMM0
>  vsubsd XMM2, XMM1, XMM0
> Source line: 6
>  vmovq RCX, XMM2
>  movabs RAX, 9223372036854775807
>  and RCX, RAX
>  vmovq XMM1, RCX
>  movabs RCX, 140067103257920
>  vucomisd XMM1, QWORD PTR [RCX]
>  ja 9
>  vmovaps XMM1, XMM0
>  jmpq 61
>  movabs RDX, 140067103257928
>  vmovsd XMM3, QWORD PTR [RDX]
>  vmovaps XMM1, XMM0
> Source line: 7
>  vmulsd XMM4, XMM1, XMM3
>  vdivsd XMM2, XMM2, XMM4
>  vaddsd XMM1, XMM1, XMM2
> Source line: 8
>  vmulsd XMM2, XMM1, XMM1
>  vsubsd XMM2, XMM2, XMM0
>  vmovq RDX, XMM2
>  and RDX, RAX
>  vmovq XMM4, RDX
>  vucomisd XMM4, QWORD PTR [RCX]
>  ja -43
> Source line: 10
>  vmovaps XMM0, XMM1
>  pop RBP
>  ret
> 
> And for C:
> double squareroot(double x)
> {
>     double it = x;
>     double err = it*it-x;
>     while (fabs(err) > 1e-13) {
>         it = it - (err)/(2*it);
>         err = it*it-x;
>     }
>     return it;
> }
> 00000000004007d0 <squareroot>:
>   4007d0: c5 fb 59 c8           vmulsd %xmm0,%xmm0,%xmm1
>   4007d4: c5 f9 28 d0           vmovapd %xmm0,%xmm2
>   4007d8: c5 fb 10 2d e0 01 00 vmovsd 0x1e0(%rip),%xmm5        # 4009c0 
> <_IO_stdin_used+0x70>
>   4007df: 00
>   4007e0: c5 fb 10 25 a8 01 00 vmovsd 0x1a8(%rip),%xmm4        # 400990 
> <_IO_stdin_used+0x40>
>   4007e7: 00
>   4007e8: c5 f3 5c c8           vsubsd %xmm0,%xmm1,%xmm1
>   4007ec: c5 f9 28 d9           vmovapd %xmm1,%xmm3
>   4007f0: c5 e1 54 dd           vandpd %xmm5,%xmm3,%xmm3
>   4007f4: c5 f9 2e dc           vucomisd %xmm4,%xmm3
>   4007f8: 76 28                 jbe    400822 <squareroot+0x52>
>   4007fa: 66 0f 1f 44 00 00     nopw   0x0(%rax,%rax,1)
>   400800: c5 eb 58 da           vaddsd %xmm2,%xmm2,%xmm3
>   400804: c5 f3 5e cb           vdivsd %xmm3,%xmm1,%xmm1
>   400808: c5 eb 5c d1           vsubsd %xmm1,%xmm2,%xmm2
>   40080c: c5 eb 59 ca           vmulsd %xmm2,%xmm2,%xmm1
>   400810: c5 f3 5c c8           vsubsd %xmm0,%xmm1,%xmm1
>   400814: c5 f9 28 d9           vmovapd %xmm1,%xmm3
>   400818: c5 e1 54 dd           vandpd %xmm5,%xmm3,%xmm3
>   40081c: c5 f9 2e dc           vucomisd %xmm4,%xmm3
>   400820: 77 de                 ja     400800 <squareroot+0x30>
>   400822: c5 f9 28 c2           vmovapd %xmm2,%xmm0
>   400826: c3                   retq
>   400827: 66 0f 1f 84 00 00 00 nopw   0x0(%rax,%rax,1)
>   40082e: 00 00
> 
> I'm definitely not trying to making the fastest possible sqrt implementation. 
> This is just for benchmarking purposes.
> 
> Thanks,
> Miles
> 
> On Tuesday, January 27, 2015 at 7:01:29 PM UTC-5, Erik Schnetter wrote:
> I would begin by looking at the generated assembler code. You would use 
> "objdump -d" for C, and the "code_native" function for Julia. Can you post 
> the results?
> 
> The discussion below is beside your points, but nevertheless:
> 
> Your sqrt implementation is very slow for two reasons:
> 
> (1) You are using a while loop to check for termination. If you use exponent 
> manipulation, then your starting guess can be within sqrt(2) of the result, 
> and then you can determine the necessary number of iterations ahead of time. 
> You probably need 5 or 6 iterations for the accuracy you chose.
> 
> (2) You are using a division in your algorithm, where divisions are 
> expensive. If you calculated 1/sqrt(x) instead of sqrt(x), then you don't 
> need a division. Later you can calculate sqrt(x) = x * (1/sqrt(x)).
> 
> A good sqrt routine should take less than 10^-8 seconds.
> 
> -erik

--
Erik Schnetter <[email protected]>
http://www.perimeterinstitute.ca/personal/eschnetter/

My email is as private as my paper mail. I therefore support encrypting
and signing email messages. Get my PGP key from https://sks-keyservers.net.

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [julia-users] benchmark with Julia 2x faster than C

Reply via email to