https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116800
Bug ID: 116800 Summary: std::simd: poor code generation of AVX512 fused multiply-add Product: gcc Version: 14.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: pieter.p.dev at outlook dot com Target Milestone: --- The AVX512 implementation of the std::experimental::fma function appears to be falling back on the AVX2 version, which results in very poor performance. For example, compare the two functions below: The 'fma_1' function uses standard arithmetic operators, and the 'fma_2' function calls the std::experimental::fma function. The 'fma_1' function compiles down to a single AVX512 vfmadd instruction (which is the desired behavior), whereas the code for 'fma_2' moves the upper half of the 512-bit arguments into separate 256-bit registers, then performs two 256-bit vfmadd instructions, and merges the results back into a single 512-bit result. #include <experimental/simd> using simd = std::experimental::native_simd<double>; simd fma_1(simd x, simd y, simd z) { return x * y + z; } simd fma_2(simd x, simd y, simd z) { return fma(x, y, z); } Compiler explorer: https://godbolt.org/z/hMM83jh4o In GCC 11 and 12, the situation is even worse: there the 'fma_2' code actually performs eight individual FMAs, along with over 30 extraction/insertion instructions! GCC 13 performs four 128-bit FMAs, and GCC 14 performs two 256-bit FMAs.