On Sunday, 7 March 2021 at 22:54:32 UTC, tsbockman wrote:
import std.meta : Repeat;
void euclideanDistanceFixedSizeArray(V)(ref Repeat!(3,
const(V)) a, ref Repeat!(3, const(V)) b, out V result)
if(is(V : __vector(float[length]), size_t length))
...
Resulting asm with is(V == __vector(float[16])):
.LCPI1_0:
.long 0x7fc00000
pure nothrow @nogc void
app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistanceFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), out __vector(float[16])):
mov rax, qword ptr [rsp + 8]
vbroadcastss zmm0, dword ptr [rip + .LCPI1_0]
...
Apparently the optimizer is too stupid to skip the redundant
float.nan broadcast when result is an `out` parameter, so just
make it `ref V result` instead for better code gen:
pure nothrow @nogc void
app.euclideanDistanceFixedSizeArray!(__vector(float[16])).euclideanDistanceFixedSizeArray(ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref const(__vector(float[16])), ref __vector(float[16])):
mov rax, qword ptr [rsp + 8]
vmovaps zmm0, zmmword ptr [rax]
vmovaps zmm1, zmmword ptr [r9]
vmovaps zmm2, zmmword ptr [r8]
vsubps zmm0, zmm0, zmmword ptr [rcx]
vmulps zmm0, zmm0, zmm0
vsubps zmm1, zmm1, zmmword ptr [rdx]
vsubps zmm2, zmm2, zmmword ptr [rsi]
vaddps zmm0, zmm0, zmm0
vfmadd231ps zmm0, zmm1, zmm1
vfmadd231ps zmm0, zmm2, zmm2
vmovaps zmmword ptr [rdi], zmm0
vsqrtps zmm0, zmm0
vmovaps zmmword ptr [rdi], zmm0
vzeroupper
ret