i was playing with SIMD last year, the approach i took was to try to minimise the M/C, so;
no attempt to support general formula, let people combine the, pre-made, most common/expensive functions, like SIMD designers did, only up the complexity of formula supported and make it x-platform. make each call work on just one SIMD instruction sized array, so no looping or conditions in the M/C. i only tried 4-way 32bit x86 SIMD, performance was as you might expect. ~5ns for 4 x Sqrt(n+1) i wanted to put up some code, but only with neon working as well, i could go back to this since i have the h/w to try it on now. example of 4 x Sqrt(n+1) using address of array. // func f40pc(i *[4]float32) TEXT ·f40pca+0(SB),$16-8 MOVQ i+0(FP),AX // get 64 bit address from first parameter MOVAPS (AX),X0 // load 128bit, 4xfloat32, from memory MOVSS $(1.0),X1 // load single precision var, 1.0 , into lower 32 bits SHUFPS $0x00,X1,X1 // duplicate it 4 times across the register ADDPS X1,X0 // parallel add 1 SQRTPS X0,X0 // parallel sqrt in-place MOVAPS X0,(AX) // put 128bit back to same address RET ideally you might be able to use 'generate' a support a more expansive range of functions, essentially making an extremely simple go compiler in go, i have the feeling you could get a large proportion of the possible performance increase with only a simple, template implementation. also support for SIMD is not signalled by architecture alone, you need to check with a CPU-instruction to find out what’s supported. see "math/floor_amd64.s" and "math/floor_asm.go" to see this happen in the std.lib. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.