Issue 162749
Summary [X64] Floating-point multiplication can get "optimized" into integer multiplication even though it's inefficient
Labels new issue
Assignees
Reporter zeux
    Given code like this (extracted out of a larger example with similar flow):

```c++
__m128 square(__m128i data) {
 __m128i y = _mm_srai_epi32(data, 16);
    __m128i x = _mm_or_si128(y, _mm_set1_epi32(3)); 
    __m128 v = _mm_cvtepi32_ps(x);
    return _mm_mul_ps(v, v);
}
```

And targeting SSE2, I would expect a more or less straightforward 1-1 lowering into SSE2 instructions, modulo `_mm_set1_epi32` which has a couple different options. Indeed, GCC generates this:

```asm
        pcmpeqd xmm1, xmm1
        psrad   xmm0, 16
 psrld   xmm1, 30
        por     xmm0, xmm1
        cvtdq2ps        xmm0, xmm0
        mulps   xmm0, xmm0
```

and MSVC generates this, opting to load `3` from memory:

```asm
        movdqu  xmm0, XMMWORD PTR [rcx]
 psrad   xmm0, 16
        orps    xmm0, XMMWORD PTR __xmm@00000003000000030000000300000003
        cvtdq2ps xmm0, xmm0
 mulps   xmm0, xmm0
```

clang, however, generates this, which is basically never a good idea:

```asm
        psrld   xmm0, 16
        por     xmm0, xmmword ptr [rip + .LCPI0_0]
        movdqa  xmm1, xmm0
        pmulhw xmm1, xmm0
        pshuflw xmm1, xmm1, 232
        pshufhw xmm1, xmm1, 232
        pshufd  xmm1, xmm1, 232
        pmullw  xmm0, xmm0
 pshuflw xmm0, xmm0, 232
        pshufhw xmm0, xmm0, 232
        pshufd xmm0, xmm0, 232
        punpcklwd       xmm0, xmm1
        cvtdq2ps xmm0, xmm0
```

It looks like it decides that it would be a great idea to multiply the integer instead of multiplying the floating-point value, as it knows the range of the integer is small enough. This results in degraded performance.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to