Issue |
144266
|
Summary |
[X86] Delaying widening results in an unnecessary `vpmovsxwd` copy
|
Labels |
new issue
|
Assignees |
|
Reporter |
dzaima
|
This C code:
```c
__m256i iter(int16_t* src1p) {
__m256i ten = _mm256_set1_epi32(10);
__m256i wload = _mm256_cvtepi16_epi32(_mm_loadu_si128((void*)src1p));
__m256i mask = _mm256_cmpgt_epi32(wload, ten);
return _mm256_add_epi32(wload, mask);
}
```
compiled with `-O3 -march=haswell`, results in:
```asm
iter:
vmovdqu xmm0, xmmword ptr [rdi]
vpmovsxwd ymm1, xmm0
vpcmpgtw xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vpmovsxwd ymm0, xmm0
vpaddd ymm0, ymm0, ymm1
ret
```
but it could be
```asm
iter:
vpmovsxwd ymm0, xmmword ptr [rdi]
vpbroadcastd ymm1, dword ptr [rip + .LCPI0_0]
vpcmpgtd ymm1, ymm0, ymm1
vpaddd ymm0, ymm1, ymm0
ret
```
avoiding having two `vpmovsxwd`s, and allowing the one that's left to have the memory operand inline.
https://godbolt.org/z/Ezrf9YbYn
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs