https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124271

            Bug ID: 124271
           Summary: x86/AVX2: missed simplification — low32×low32→u64
                    vectorized multiply expands to generic u64-mul
                    sequence instead of single vpmuludq
           Product: gcc
           Version: 15.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: adamant.pwn at gmail dot com
  Target Milestone: ---

Created attachment 63790
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=63790&action=edit
Preprocessed source

On x86-64 with -std=c++23 -O3 -mavx2, GCC vectorizes the loop below (32-byte
vectors), but the generated AVX2 loop computes the multiply via a generic
packed-uint64_t multiply expansion: it masks each input with 0xffffffff and
then performs cross-term work (vpsrlq + 3×vpmuludq + adds/shifts).

After the mask, the upper 32 bits of each 64-bit element are known zero, so the
cross terms are provably zero and the operation can be implemented directly
with AVX2 vpmuludq (which multiplies even dword lanes 0,2,4,6 → 4×u64 results),
i.e., one vpmuludq per 4 elements.

Clang trunk emits the direct vpmuludq idiom for the same source (see Godbolt
links below).

Testcase (also attached as preprocessed t.ii):
  #include <cstdint>

  static inline std::uint64_t mul32(std::uint64_t a, std::uint64_t b) {
      return std::uint64_t(std::uint32_t(a)) * std::uint64_t(std::uint32_t(b));
  }

  void many_mul3(std::uint64_t* __restrict a, const std::uint64_t* __restrict
b) {
      for (int i = 0; i < 1024; i++)
          a[i] = mul32(a[i], b[i]);
  }

Assembly:
  g++ -std=c++23 -O3 -mavx2 -S -masm=intel t.cpp -o t.s

Vectorizer diagnostics (-fopt-info-vec-all):
  t.cpp:11:23: optimized: loop vectorized using 32 byte vectors
  t.cpp:8:6: note: vectorized 1 loops in function.
  t.cpp:13:1: note: ***** Analysis failed with vector mode VOID

Actual generated inner loop (GCC 15.2.1 20260209, -O3 -mavx2):
  vpand    ymm4, ymm5, YMMWORD PTR [rdi+rax]
  vpand    ymm3, ymm5, YMMWORD PTR [rsi+rax]
  vpsrlq   ymm2, ymm4, 32
  vpsrlq   ymm0, ymm3, 32
  vpmuludq ymm0, ymm0, ymm4
  vpmuludq ymm2, ymm2, ymm3
  vpmuludq ymm1, ymm3, ymm4
  vpaddq   ymm0, ymm0, ymm2
  vpsllq   ymm0, ymm0, 32
  vpaddq   ymm0, ymm1, ymm0
  vmovdqu  YMMWORD PTR [rdi+rax], ymm0

Expected:

Since the semantics are uint64_t(uint32_t(a[i])) * uint64_t(uint32_t(b[i])),
the low 32-bit halves of each 64-bit element are the only inputs. On AVX2,
vpmuludq multiplies the even dword lanes (0,2,4,6), which correspond exactly to
the low 32 bits of each uint64_t lane, producing 4×u64 results. Therefore,
after masking, the cross-term work in the generic u64 multiplication expansion
is unnecessary and could be simplified to the direct vpmuludq idiom (one per 4
elements), without shifts/adds/cross-term multiplies.

Toolchain / environment
  Target: x86_64-pc-linux-gnu
  gcc version 15.2.1 20260209 (GCC)
  Arch Linux build

Godbolt:
  GCC: https://godbolt.org/z/oYWGW3zKf
  Clang: https://godbolt.org/z/PfjPrPr4o

Reply via email to