https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82460
Bug ID: 82460
Summary: AVX512: choose between vpermi2d and vpermt2d to save
mov instructions. Also, fails to optimize away shifts
before shuffle
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization, ssemmx
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
#include <immintrin.h>
// gcc -O3 -march=skylake-avx512 -mavx512vbmi 8.0.0 20171004
// https://godbolt.org/g/fVt4Kb
__m512i vpermi2d(__m512i t1, __m512i control, char *src) {
return _mm512_permutex2var_epi32(control, t1, _mm512_loadu_si512(src));
}
vpermt2d (%rdi), %zmm0, %zmm1
vmovdqa64 %zmm1, %zmm0
ret
clang emits vpermi2d (%rdi), %zmm1, %zmm0
__m512i vpermi2b(__m512i t1, __m512i a, __m512i b) {
return _mm512_permutex2var_epi8(a, t1, b);
}
vpermt2b %zmm2, %zmm0, %zmm1
vmovdqa64 %zmm1, %zmm0
ret
clang emits vpermi2b %zmm2, %zmm1, %zmm0
This one compiles ok, though:
__m512i vpermt2d(__m512i t1, __m512i control, char *src) {
return _mm512_permutex2var_epi32(t1, control, _mm512_loadu_si512(src));
}
vpermt2d (%rdi), %zmm1, %zmm0
---
But when auto-vectorizing this with AVX512VBMI (see bug 82459 for AVX512BW
missed optimizations), gcc uses vpermi2b when vpermt2b would be better:
void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t
*__restrict__ src, size_t bytes) {
uint8_t *end_dst = dst + bytes;
do{
*dst++ = *src++ >> 8;
} while(dst < end_dst);
}
.L9:
vmovdqa64 (%rsi,%rax,2), %zmm0
vmovdqa64 64(%rsi,%rax,2), %zmm1
vmovdqa64 %zmm2, %zmm3 # copy the index
vpsrlw $8, %zmm0, %zmm0
vpsrlw $8, %zmm1, %zmm1
vpermi2b %zmm1, %zmm0, %zmm3 # then destroy it
vmovdqu8 %zmm3, (%rcx,%rax) # extra uop according to
Intel: bug 82459
addq $64, %rax
cmpq %rax, %rdi
jne .L9
Of course, the shifts are redundant when we have a full byte shuffle that
doesn't do any saturating:
# different shuffle control in zmm1
.L9
vmovdqa64 (%rsi,%rax,2), %zmm0
vpermt2b 64(%rsi,%rax,2), %zmm1, %zmm0
vmovdqu64 %zmm0, (%rcx,%rax)
addq $64, %rax
cmpq %rax, %rdi
jne .L9
If unrolling, use pointer increments so the shuffle can maybe avoid
un-lamination, although some multi-uop instructions don't micro-fuse in the
first place.
vpermt2w is 3 uops on Skylake-AVX512 (p0 + 2p5), so we should expect vpermt2b
to be at least that slow on the first CPUs that support it. On a CPU where
vpermt2b is p0 + 2p5, this loop will run at about one store per 2 clocks, the
same as what you can achieve with 2x shift + vpackuswb + vpermq (bug 82459).
But this has one fewer p0 uop.
With indexing from the end of the arrays to save the CMP, this could also be 7
fused-domain uops for the front-end (assuming no micro-fusion for the vpermt2b
+ load), but assuming the store does fuse.