Here's the Julia assembly:
function squareroot(x)
it = x
err = it*it-x # line 5
while abs(err) > 1e-13
it = it - (err)/(2it) # line 7
err = it*it-x
end
return it
end
Source line: 5
push RBP
mov RBP, RSP
Source line: 5
vmulsd XMM1, XMM0, XMM0
vsubsd XMM2, XMM1, XMM0
Source line: 6
vmovq RCX, XMM2
movabs RAX, 9223372036854775807
and RCX, RAX
vmovq XMM1, RCX
movabs RCX, 140067103257920
vucomisd XMM1, QWORD PTR [RCX]
ja 9
vmovaps XMM1, XMM0
jmpq 61
movabs RDX, 140067103257928
vmovsd XMM3, QWORD PTR [RDX]
vmovaps XMM1, XMM0
Source line: 7
vmulsd XMM4, XMM1, XMM3
vdivsd XMM2, XMM2, XMM4
vaddsd XMM1, XMM1, XMM2
Source line: 8
vmulsd XMM2, XMM1, XMM1
vsubsd XMM2, XMM2, XMM0
vmovq RDX, XMM2
and RDX, RAX
vmovq XMM4, RDX
vucomisd XMM4, QWORD PTR [RCX]
ja -43
Source line: 10
vmovaps XMM0, XMM1
pop RBP
ret
And for C:
double squareroot(double x)
{
double it = x;
double err = it*it-x;
while (fabs(err) > 1e-13) {
it = it - (err)/(2*it);
err = it*it-x;
}
return it;
}
00000000004007d0 <squareroot>:
4007d0: c5 fb 59 c8 vmulsd %xmm0,%xmm0,%xmm1
4007d4: c5 f9 28 d0 vmovapd %xmm0,%xmm2
4007d8: c5 fb 10 2d e0 01 00 vmovsd 0x1e0(%rip),%xmm5 # 4009c0
<_IO_stdin_used+0x70>
4007df: 00
4007e0: c5 fb 10 25 a8 01 00 vmovsd 0x1a8(%rip),%xmm4 # 400990
<_IO_stdin_used+0x40>
4007e7: 00
4007e8: c5 f3 5c c8 vsubsd %xmm0,%xmm1,%xmm1
4007ec: c5 f9 28 d9 vmovapd %xmm1,%xmm3
4007f0: c5 e1 54 dd vandpd %xmm5,%xmm3,%xmm3
4007f4: c5 f9 2e dc vucomisd %xmm4,%xmm3
4007f8: 76 28 jbe 400822 <squareroot+0x52>
4007fa: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
400800: c5 eb 58 da vaddsd %xmm2,%xmm2,%xmm3
400804: c5 f3 5e cb vdivsd %xmm3,%xmm1,%xmm1
400808: c5 eb 5c d1 vsubsd %xmm1,%xmm2,%xmm2
40080c: c5 eb 59 ca vmulsd %xmm2,%xmm2,%xmm1
400810: c5 f3 5c c8 vsubsd %xmm0,%xmm1,%xmm1
400814: c5 f9 28 d9 vmovapd %xmm1,%xmm3
400818: c5 e1 54 dd vandpd %xmm5,%xmm3,%xmm3
40081c: c5 f9 2e dc vucomisd %xmm4,%xmm3
400820: 77 de ja 400800 <squareroot+0x30>
400822: c5 f9 28 c2 vmovapd %xmm2,%xmm0
400826: c3 retq
400827: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
40082e: 00 00
I'm definitely not trying to making the fastest possible sqrt
implementation. This is just for benchmarking purposes.
Thanks,
Miles
On Tuesday, January 27, 2015 at 7:01:29 PM UTC-5, Erik Schnetter wrote:
>
> I would begin by looking at the generated assembler code. You would use
> "objdump -d" for C, and the "code_native" function for Julia. Can you post
> the results?
>
> The discussion below is beside your points, but nevertheless:
>
> Your sqrt implementation is very slow for two reasons:
>
> (1) You are using a while loop to check for termination. If you use
> exponent manipulation, then your starting guess can be within sqrt(2) of
> the result, and then you can determine the necessary number of iterations
> ahead of time. You probably need 5 or 6 iterations for the accuracy you
> chose.
>
> (2) You are using a division in your algorithm, where divisions are
> expensive. If you calculated 1/sqrt(x) instead of sqrt(x), then you don't
> need a division. Later you can calculate sqrt(x) = x * (1/sqrt(x)).
>
> A good sqrt routine should take less than 10^-8 seconds.
>
> -erik
>