| Issue |
162749
|
| Summary |
[X64] Floating-point multiplication can get "optimized" into integer multiplication even though it's inefficient
|
| Labels |
new issue
|
| Assignees |
|
| Reporter |
zeux
|
Given code like this (extracted out of a larger example with similar flow):
```c++
__m128 square(__m128i data) {
__m128i y = _mm_srai_epi32(data, 16);
__m128i x = _mm_or_si128(y, _mm_set1_epi32(3));
__m128 v = _mm_cvtepi32_ps(x);
return _mm_mul_ps(v, v);
}
```
And targeting SSE2, I would expect a more or less straightforward 1-1 lowering into SSE2 instructions, modulo `_mm_set1_epi32` which has a couple different options. Indeed, GCC generates this:
```asm
pcmpeqd xmm1, xmm1
psrad xmm0, 16
psrld xmm1, 30
por xmm0, xmm1
cvtdq2ps xmm0, xmm0
mulps xmm0, xmm0
```
and MSVC generates this, opting to load `3` from memory:
```asm
movdqu xmm0, XMMWORD PTR [rcx]
psrad xmm0, 16
orps xmm0, XMMWORD PTR __xmm@00000003000000030000000300000003
cvtdq2ps xmm0, xmm0
mulps xmm0, xmm0
```
clang, however, generates this, which is basically never a good idea:
```asm
psrld xmm0, 16
por xmm0, xmmword ptr [rip + .LCPI0_0]
movdqa xmm1, xmm0
pmulhw xmm1, xmm0
pshuflw xmm1, xmm1, 232
pshufhw xmm1, xmm1, 232
pshufd xmm1, xmm1, 232
pmullw xmm0, xmm0
pshuflw xmm0, xmm0, 232
pshufhw xmm0, xmm0, 232
pshufd xmm0, xmm0, 232
punpcklwd xmm0, xmm1
cvtdq2ps xmm0, xmm0
```
It looks like it decides that it would be a great idea to multiply the integer instead of multiplying the floating-point value, as it knows the range of the integer is small enough. This results in degraded performance.
_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs