Hi Hongtao,
Many thanks for reviewing the x86_64 pieces.

>    if (negate)
> -    cmp = ix86_expand_int_sse_cmp (operands[0], EQ, cmp,
> -                                CONST0_RTX (GET_MODE (cmp)),
> -                                NULL, NULL, &negate);
> -
> -  gcc_assert (!negate);
> +    {
> +      if (TARGET_AVX512F && GET_MODE_SIZE (GET_MODE (cmp)) >= 16)
> +     cmp = gen_rtx_XOR (GET_MODE (cmp), cmp, CONSTM1_RTX
> (GET_MODE (cmp)));
> +      else
> +     {
> +       cmp = ix86_expand_int_sse_cmp (operands[0], EQ, cmp,
> +                                      CONST0_RTX (GET_MODE (cmp)),
> +                                      NULL, NULL, &negate);
> +       gcc_assert (!negate);
> +     }
> +    }
> 
> Technically it's correct, however, in actual scenarios, avx512 (x86-64-v4)
will enter
> ix86_expand_mask_vec_cmp, so this optimization appears to only target the
> scenario of avx512f + no-avx512vl + VL == 16/32, which doesn't sound
particularly
> useful.

The mistake in this reasoning is that this function is entered in actual
scenarios.

Consider:

typedef char v32qi __attribute__((vector_size(16)));
v32qi x, y, m;
void bar() { m = x != y; }

which when compiled with -O2 -mavx512vl on mainline currently generates:

foo:    vmovdqa x(%rip), %xmm0
        vpxor   %xmm1, %xmm1, %xmm1
        vpcmpeqb        y(%rip), %xmm0, %xmm0
        vpcmpeqb        %xmm1, %xmm0, %xmm0
        vmovdqa %xmm0, m(%rip)
        ret

which uses vpxor and vpcmpeqb to invert the mask.
with the proposed chunk above, we instead generate:

foo:    vmovdqa x(%rip), %xmm0
        vpcmpeqb        y(%rip), %xmm0, %xmm0
        vpternlogd      $0x55, %xmm0, %xmm0, %xmm0
        vmovdqa %xmm0, m(%rip)
        ret

Not only is this one less instruction, and shorter in bytes,
but the not/xor/ternlog can be fused by combine with any
following binary logic, where unfortunately the vpcmpeqb
against zero can't (easily) be.

The Bugzilla PR concerns x86_64 using vpcmpeqb to 
negate masks when it shouldn't be; the example above
is exactly the sort of case that it was complaining about.
I was hoping the above not/xor/ternlog and a following
blend or pand-pandn-por could eventually be fused into
a single ternlog instruction, i.e. with ternlog the RTL
optimizers (combine) can potentially swap operands of
VCOND_MASK without requiring the middle-end's help.

Thanks (again) in advance,
Roger
--


Reply via email to