On Fri, Sep 25, 2020 at 08:58:35AM +0200, Richard Biener wrote: > On Thu, Sep 24, 2020 at 9:38 PM Segher Boessenkool > <seg...@kernel.crashing.org> wrote: > > after which I get (-march=znver2) > > > > setg: > > vmovd %edi, %xmm1 > > vmovd %esi, %xmm2 > > vpbroadcastd %xmm1, %ymm1 > > vpbroadcastd %xmm2, %ymm2 > > vpcmpeqd .LC0(%rip), %ymm1, %ymm1 > > vpandn %ymm0, %ymm1, %ymm0 > > vpand %ymm2, %ymm1, %ymm1 > > vpor %ymm0, %ymm1, %ymm0 > > ret > > I get with -march=znver2 -O2 > > vmovd %edi, %xmm1 > vmovd %esi, %xmm2 > vpbroadcastd %xmm1, %ymm1 > vpbroadcastd %xmm2, %ymm2 > vpcmpeqd .LC0(%rip), %ymm1, %ymm1 > vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
Ah, maybe my x86 compiler it too old... x86_64-linux-gcc (GCC) 10.0.0 20190919 (experimental) not exactly old, huh. I wonder what I do wrong then. > Now, with SSE4.2 the 16byte case compiles to > > setg: > .LFB0: > .cfi_startproc > movd %edi, %xmm3 > movdqa %xmm0, %xmm1 > movd %esi, %xmm4 > pshufd $0, %xmm3, %xmm0 > pcmpeqd .LC0(%rip), %xmm0 > movdqa %xmm0, %xmm2 > pandn %xmm1, %xmm2 > pshufd $0, %xmm4, %xmm1 > pand %xmm1, %xmm0 > por %xmm2, %xmm0 > ret > > since there's no blend with a variable mask IIRC. PowerPC got at least *that* right since time immemorial :-) > with aarch64 and SVE it doesn't handle the 32byte case at all, > the 16byte case compiles to > > setg: > .LFB0: > .cfi_startproc > adrp x2, .LC0 > dup v1.4s, w0 > dup v2.4s, w1 > ldr q3, [x2, #:lo12:.LC0] > cmeq v1.4s, v1.4s, v3.4s > bit v0.16b, v2.16b, v1.16b > > which looks equivalent to the AVX2 code. Yes, and we can do pretty much the same on Power, too. > For all of those varying the vector element type may also > cause "issues" I guess. For us, as long as it stays 16B vectors, all should be fine. There may be issues in the compiler, but at least the hardware has no problem with it ;-) > > and for powerpc (changing it to 16B vectors, -mcpu=power9) it is > > > > setg: > > addis 9,2,.LC0@toc@ha > > mtvsrws 32,5 > > mtvsrws 33,6 > > addi 9,9,.LC0@toc@l > > lxv 45,0(9) > > vcmpequw 0,0,13 > > xxsel 34,34,33,32 > > blr The -mcpu=power10 code right now is just plxv 45,.LC0@pcrel mtvsrws 32,5 mtvsrws 33,6 vcmpequw 0,0,13 xxsel 34,34,33,32 blr (exactly the same, but less memory address setup cost), so doing something like this as a generic version would work quite well pretty much everywhere I think! Segher