Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

tsbockman via Digitalmars-d-learn Thu, 25 Feb 2021 19:40:36 -0800

On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:

Is there any way to guarantee that "packed" versions of SIMDinstructions will be used?(e.g. vmulps, vsqrtps, etc...)To give some context, this is a sample of one of the functionsthat could benefit from better SIMD usage :
float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {

You need to use __vector(float[4]) instead of float[3] to tellthe compiler to pack multiple elements per SIMD register. Rightnow your data lacks proper alignment for SIMD load/stores.

Beyond that, SIMD code is rather difficult to optimize. Codewritten in ignorance or in a rush is unlikely to be meaningfullyfaster than ordinary scalar code, unless the data flow is verysimple. You will probably get a bigger speedup for less effortand pain by first minimizing heap allocations, maximizinglocality of reference, minimizing indirections, and minimizingmemory use. (And, of course, it should go without saying thatchoosing an asymptotically efficient high-level algorithm is moreimportant than any micro-optimization for large data sets.)Nevertheless, if you are up to the challenge, SIMD can sometimesprovide a final 2-3x speed boost.

Your algorithms will need to be designed to minimize mixing ofdata between SIMD channels, as this forces the generation of lotsof extra instructions to swizzle the data, or worse to unpack andrepack it. Something like a Cartesian dot product or crossproduct will benefit much less from SIMD than vector addition,for example. Sometimes the amount of swizzling can be greatlyreduced with a little algebra, other times you might need torefactor an array of structures into a structure of arrays.

Per-element conditional branches are very bad, and oftencompletely defeat the benefits of SIMD. For very short segmentsof code (like conditional assignment), replace them with a SIMDconditional move (vcmp and vblend). Bit-twiddling is your friend.

Finally, do not trust the compiler or the optimizer. People loveto make the claim that "The Compiler" is always better thanhumans at micro-optimizations, but this is not at all the casefor SIMD code with current systems. I have found even LLVM toproduce quite bad SIMD code for complex algorithms, unless Icarefully structure my code to make it as easy as possible forthe optimizer to get to the final assembly I want. A sprinklingof manual assembly code (directly, or via a library) is alsonecessary to fill in certain instructions that the compilerdoesn't know when to use at all.


Resources I have found very helpful:

Matt Godbolt's Compiler Explorer online visual disassembler(supports D):

    https://godbolt.org/

Felix Cloutier's x86 and amd64 instruction reference:
    https://www.felixcloutier.com/x86/

Agner Fog's optimization guide (especially the instructiontables):

    https://agner.org/optimize/

Re: Optimizing for SIMD: best practices?(i.e. what features are allowed?)

Reply via email to