On Mon, Mar 2, 2026 at 5:36 PM Roger Sayle <[email protected]> wrote:
>
>
> Hi Hongtao,
> Many thanks for reviewing the x86_64 pieces.
>
> >    if (negate)
> > -    cmp = ix86_expand_int_sse_cmp (operands[0], EQ, cmp,
> > -                                CONST0_RTX (GET_MODE (cmp)),
> > -                                NULL, NULL, &negate);
> > -
> > -  gcc_assert (!negate);
> > +    {
> > +      if (TARGET_AVX512F && GET_MODE_SIZE (GET_MODE (cmp)) >= 16)
> > +     cmp = gen_rtx_XOR (GET_MODE (cmp), cmp, CONSTM1_RTX
> > (GET_MODE (cmp)));
> > +      else
> > +     {
> > +       cmp = ix86_expand_int_sse_cmp (operands[0], EQ, cmp,
> > +                                      CONST0_RTX (GET_MODE (cmp)),
> > +                                      NULL, NULL, &negate);
> > +       gcc_assert (!negate);
> > +     }
> > +    }
> >
> > Technically it's correct, however, in actual scenarios, avx512 (x86-64-v4)
> will enter
> > ix86_expand_mask_vec_cmp, so this optimization appears to only target the
> > scenario of avx512f + no-avx512vl + VL == 16/32, which doesn't sound
> particularly
> > useful.
>
> The mistake in this reasoning is that this function is entered in actual
> scenarios.
>
> Consider:
>
> typedef char v32qi __attribute__((vector_size(16)));
> v32qi x, y, m;
> void bar() { m = x != y; }
>
> which when compiled with -O2 -mavx512vl on mainline currently generates:
>
> foo:    vmovdqa x(%rip), %xmm0
>         vpxor   %xmm1, %xmm1, %xmm1
>         vpcmpeqb        y(%rip), %xmm0, %xmm0
>         vpcmpeqb        %xmm1, %xmm0, %xmm0
>         vmovdqa %xmm0, m(%rip)
>         ret
>
> which uses vpxor and vpcmpeqb to invert the mask.
> with the proposed chunk above, we instead generate:
>
> foo:    vmovdqa x(%rip), %xmm0
>         vpcmpeqb        y(%rip), %xmm0, %xmm0
>         vpternlogd      $0x55, %xmm0, %xmm0, %xmm0
>         vmovdqa %xmm0, m(%rip)
>         ret
>
> Not only is this one less instruction, and shorter in bytes,
> but the not/xor/ternlog can be fused by combine with any
> following binary logic, where unfortunately the vpcmpeqb
> against zero can't (easily) be.
>
> The Bugzilla PR concerns x86_64 using vpcmpeqb to
> negate masks when it shouldn't be; the example above
> is exactly the sort of case that it was complaining about.
> I was hoping the above not/xor/ternlog and a following
> blend or pand-pandn-por could eventually be fused into
> a single ternlog instruction, i.e. with ternlog the RTL
> optimizers (combine) can potentially swap operands of
> VCOND_MASK without requiring the middle-end's help.

I c, thanks for the explanation.

>
> Thanks (again) in advance,
> Roger
> --
>
>


-- 
BR,
Hongtao

Reply via email to