i was playing with SIMD last year,

the approach i took was to try to minimise the M/C, so;

no attempt to support general formula, let people combine the, pre-made, 
most common/expensive functions, like SIMD designers did, only up the 
complexity of formula supported and make it x-platform.
make each call work on just one SIMD instruction sized array, so no looping 
or conditions in the M/C.

i only tried 4-way 32bit x86 SIMD, performance was as you might expect. 
~5ns for 4 x Sqrt(n+1) 

i wanted to put up some code, but only with neon working as well, i could 
go back to this since i have the h/w to try it on now.

example of 4 x Sqrt(n+1) using address of array.

// func f40pc(i *[4]float32)
TEXT ·f40pca+0(SB),$16-8
MOVQ    i+0(FP),AX   // get 64 bit address from first parameter
MOVAPS (AX),X0       // load 128bit, 4xfloat32,  from memory
MOVSS   $(1.0),X1       // load single precision var, 1.0 , into lower 32 
bits
SHUFPS  $0x00,X1,X1  // duplicate it 4 times across the register
ADDPS   X1,X0    // parallel add 1
SQRTPS  X0,X0    // parallel sqrt in-place
MOVAPS   X0,(AX)  // put 128bit back to same address 
RET   


ideally you might be able to use 'generate' a support a more expansive 
range of functions, essentially making an extremely simple go compiler in 
go, i have the feeling you could get a large proportion of the possible 
performance increase with only a simple, template implementation.

also support for SIMD is not signalled by architecture alone, you need to 
check with a CPU-instruction to find out what’s supported. see 
"math/floor_amd64.s" and "math/floor_asm.go" to see this happen in the 
std.lib.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to