On Fri, Sep 25, 2020 at 08:58:35AM +0200, Richard Biener wrote:
> On Thu, Sep 24, 2020 at 9:38 PM Segher Boessenkool
> <seg...@kernel.crashing.org> wrote:
> > after which I get (-march=znver2)
> >
> > setg:
> >         vmovd   %edi, %xmm1
> >         vmovd   %esi, %xmm2
> >         vpbroadcastd    %xmm1, %ymm1
> >         vpbroadcastd    %xmm2, %ymm2
> >         vpcmpeqd        .LC0(%rip), %ymm1, %ymm1
> >         vpandn  %ymm0, %ymm1, %ymm0
> >         vpand   %ymm2, %ymm1, %ymm1
> >         vpor    %ymm0, %ymm1, %ymm0
> >         ret
> 
> I get with -march=znver2 -O2
> 
>         vmovd   %edi, %xmm1
>         vmovd   %esi, %xmm2
>         vpbroadcastd    %xmm1, %ymm1
>         vpbroadcastd    %xmm2, %ymm2
>         vpcmpeqd        .LC0(%rip), %ymm1, %ymm1
>         vpblendvb       %ymm1, %ymm2, %ymm0, %ymm0

Ah, maybe my x86 compiler it too old...
  x86_64-linux-gcc (GCC) 10.0.0 20190919 (experimental)
not exactly old, huh.  I wonder what I do wrong then.

> Now, with SSE4.2 the 16byte case compiles to
> 
> setg:
> .LFB0:
>         .cfi_startproc
>         movd    %edi, %xmm3
>         movdqa  %xmm0, %xmm1
>         movd    %esi, %xmm4
>         pshufd  $0, %xmm3, %xmm0
>         pcmpeqd .LC0(%rip), %xmm0
>         movdqa  %xmm0, %xmm2
>         pandn   %xmm1, %xmm2
>         pshufd  $0, %xmm4, %xmm1
>         pand    %xmm1, %xmm0
>         por     %xmm2, %xmm0
>         ret
> 
> since there's no blend with a variable mask IIRC.

PowerPC got at least *that* right since time immemorial :-)

> with aarch64 and SVE it doesn't handle the 32byte case at all,
> the 16byte case compiles to
> 
> setg:
> .LFB0:
>         .cfi_startproc
>         adrp    x2, .LC0
>         dup     v1.4s, w0
>         dup     v2.4s, w1
>         ldr     q3, [x2, #:lo12:.LC0]
>         cmeq    v1.4s, v1.4s, v3.4s
>         bit     v0.16b, v2.16b, v1.16b
> 
> which looks equivalent to the AVX2 code.

Yes, and we can do pretty much the same on Power, too.

> For all of those varying the vector element type may also
> cause "issues" I guess.

For us, as long as it stays 16B vectors, all should be fine.  There may
be issues in the compiler, but at least the hardware has no problem with
it ;-)

> > and for powerpc (changing it to 16B vectors, -mcpu=power9) it is
> >
> > setg:
> >         addis 9,2,.LC0@toc@ha
> >         mtvsrws 32,5
> >         mtvsrws 33,6
> >         addi 9,9,.LC0@toc@l
> >         lxv 45,0(9)
> >         vcmpequw 0,0,13
> >         xxsel 34,34,33,32
> >         blr

The -mcpu=power10 code right now is just

        plxv 45,.LC0@pcrel
        mtvsrws 32,5
        mtvsrws 33,6
        vcmpequw 0,0,13
        xxsel 34,34,33,32
        blr

(exactly the same, but less memory address setup cost), so doing
something like this as a generic version would work quite well pretty
much everywhere I think!


Segher

Reply via email to