Issue |
137700
|
Summary |
[AVX-512] `vpsubd a, b, vpmovm2d` can be done via a masked `vpsubd`
|
Labels |
new issue
|
Assignees |
|
Reporter |
dzaima
|
This code, compiled via `-O3 -march=znver5`:
```c
#include<stdint.h>
#include<immintrin.h>
__m512i count_gt100(uint32_t* in, size_t count) {
__m512i neg_one = _mm512_set1_epi32(-1);
__m512i acc = _mm512_set1_epi32(0);
#pragma clang loop unroll(disable) // just to reduce noise
for (size_t i = 0; i < count; i++) {
__m512i val = _mm512_loadu_si512(in);
__mmask16 mask = _mm512_cmpgt_epi32_mask(val, _mm512_set1_epi32(100));
acc = _mm512_mask_sub_epi32(acc, mask, acc, neg_one);
in += 16;
}
return acc;
}
```
contains:
```asm
vpmovm2d zmm2, k0
vpsubd zmm0, zmm0, zmm2
```
whereas the implied desired code by the intrinsics, and as such gcc's codegen, takes just one instruction to do that - a masked `vpsubd`.
https://godbolt.org/z/MGjfGq96h
_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs