[Bug libstdc++/116800] New: std::simd: poor code generation of AVX512 fused multiply-add

pieter.p.dev at outlook dot com via Gcc-bugs Sat, 21 Sep 2024 07:39:51 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116800


            Bug ID: 116800
           Summary: std::simd: poor code generation of AVX512 fused
                    multiply-add
           Product: gcc
           Version: 14.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pieter.p.dev at outlook dot com
  Target Milestone: ---

The AVX512 implementation of the std::experimental::fma function appears to be
falling back on the AVX2 version, which results in very poor performance.

For example, compare the two functions below:
The 'fma_1' function uses standard arithmetic operators,
and the 'fma_2' function calls the std::experimental::fma function.

The 'fma_1' function compiles down to a single AVX512 vfmadd instruction (which
is the desired behavior), whereas the code for 'fma_2' moves the upper half of
the 512-bit arguments into separate 256-bit registers, then performs two
256-bit vfmadd instructions, and merges the results back into a single 512-bit
result.

    #include <experimental/simd>
    using simd = std::experimental::native_simd<double>;

    simd fma_1(simd x, simd y, simd z) {
        return x * y + z;
    }

    simd fma_2(simd x, simd y, simd z) {
        return fma(x, y, z);
    }

Compiler explorer: https://godbolt.org/z/hMM83jh4o

In GCC 11 and 12, the situation is even worse: there the 'fma_2' code actually
performs eight individual FMAs, along with over 30 extraction/insertion
instructions! GCC 13 performs four 128-bit FMAs, and GCC 14 performs two
256-bit FMAs.

[Bug libstdc++/116800] New: std::simd: poor code generation of AVX512 fused multiply-add

Reply via email to