On Thursday, 25 February 2021 at 11:28:14 UTC, z wrote:
Is there any way to guarantee that "packed" versions of SIMD instructions will be used?(e.g. vmulps, vsqrtps, etc...) To give some context, this is a sample of one of the functions that could benefit from better SIMD usage :
float euclideanDistanceFixedSizeArray(float[3] a, float[3] b) {

You need to use __vector(float[4]) instead of float[3] to tell the compiler to pack multiple elements per SIMD register. Right now your data lacks proper alignment for SIMD load/stores.

Beyond that, SIMD code is rather difficult to optimize. Code written in ignorance or in a rush is unlikely to be meaningfully faster than ordinary scalar code, unless the data flow is very simple. You will probably get a bigger speedup for less effort and pain by first minimizing heap allocations, maximizing locality of reference, minimizing indirections, and minimizing memory use. (And, of course, it should go without saying that choosing an asymptotically efficient high-level algorithm is more important than any micro-optimization for large data sets.) Nevertheless, if you are up to the challenge, SIMD can sometimes provide a final 2-3x speed boost.

Your algorithms will need to be designed to minimize mixing of data between SIMD channels, as this forces the generation of lots of extra instructions to swizzle the data, or worse to unpack and repack it. Something like a Cartesian dot product or cross product will benefit much less from SIMD than vector addition, for example. Sometimes the amount of swizzling can be greatly reduced with a little algebra, other times you might need to refactor an array of structures into a structure of arrays.

Per-element conditional branches are very bad, and often completely defeat the benefits of SIMD. For very short segments of code (like conditional assignment), replace them with a SIMD conditional move (vcmp and vblend). Bit-twiddling is your friend.

Finally, do not trust the compiler or the optimizer. People love to make the claim that "The Compiler" is always better than humans at micro-optimizations, but this is not at all the case for SIMD code with current systems. I have found even LLVM to produce quite bad SIMD code for complex algorithms, unless I carefully structure my code to make it as easy as possible for the optimizer to get to the final assembly I want. A sprinkling of manual assembly code (directly, or via a library) is also necessary to fill in certain instructions that the compiler doesn't know when to use at all.

Resources I have found very helpful:

Matt Godbolt's Compiler Explorer online visual disassembler (supports D):
    https://godbolt.org/

Felix Cloutier's x86 and amd64 instruction reference:
    https://www.felixcloutier.com/x86/

Agner Fog's optimization guide (especially the instruction tables):
    https://agner.org/optimize/

Reply via email to