Re: [PATCH 2/2][RFC] Add loop masking support for x86

Hongtao Liu via Gcc-patches Wed, 21 Jul 2021 02:33:31 -0700

On Wed, Jul 21, 2021 at 4:16 PM Richard Biener <rguent...@suse.de> wrote:
>
> On Wed, 21 Jul 2021, Hongtao Liu wrote:
>
> > On Tue, Jul 20, 2021 at 3:38 PM Richard Biener <rguent...@suse.de> wrote:
> > >
> > > On Tue, 20 Jul 2021, Hongtao Liu wrote:
> > >
> > > > On Fri, Jul 16, 2021 at 5:11 PM Richard Biener <rguent...@suse.de> 
> > > > wrote:
> > > > >
> > > > > On Thu, 15 Jul 2021, Richard Biener wrote:
> > > > >
> > > > > > On Thu, 15 Jul 2021, Richard Biener wrote:
> > > > > >
> > > > > > > OK, guess I was more looking at
> > > > > > >
> > > > > > > #define N 32
> > > > > > > int foo (unsigned long *a, unsigned long * __restrict b,
> > > > > > >          unsigned int *c, unsigned int * __restrict d,
> > > > > > >          int n)
> > > > > > > {
> > > > > > >   unsigned sum = 1;
> > > > > > >   for (int i = 0; i < n; ++i)
> > > > > > >     {
> > > > > > >       b[i] += a[i];
> > > > > > >       d[i] += c[i];
> > > > > > >     }
> > > > > > >   return sum;
> > > > > > > }
> > > > > > >
> > > > > > > where we on x86 AVX512 vectorize with V8DI and V16SI and we
> > > > > > > generate two masks for the two copies of V8DI (VF is 16) and one
> > > > > > > mask for V16SI.  With SVE I see
> > > > > > >
> > > > > > >         punpklo p1.h, p0.b
> > > > > > >         punpkhi p2.h, p0.b
> > > > > > >
> > > > > > > that's sth I expected to see for AVX512 as well, using the V16SI
> > > > > > > mask and unpacking that to two V8DI ones.  But I see
> > > > > > >
> > > > > > >         vpbroadcastd    %eax, %ymm0
> > > > > > >         vpaddd  %ymm12, %ymm0, %ymm0
> > > > > > >         vpcmpud $6, %ymm0, %ymm11, %k3
> > > > > > >         vpbroadcastd    %eax, %xmm0
> > > > > > >         vpaddd  %xmm10, %xmm0, %xmm0
> > > > > > >         vpcmpud $1, %xmm7, %xmm0, %k1
> > > > > > >         vpcmpud $6, %xmm0, %xmm8, %k2
> > > > > > >         kortestb        %k1, %k1
> > > > > > >         jne     .L3
> > > > > > >
> > > > > > > so three %k masks generated by vpcmpud.  I'll have to look what's
> > > > > > > the magic for SVE and why that doesn't trigger for x86 here.
> > > > > >
> > > > > > So answer myself, vect_maybe_permute_loop_masks looks for
> > > > > > vec_unpacku_hi/lo_optab, but with AVX512 the vector bools have
> > > > > > QImode so that doesn't play well here.  Not sure if there
> > > > > > are proper mask instructions to use (I guess there's a shift
> > > > > > and lopart is free).  This is QI:8 to two QI:4 (bits) mask
> > > > Yes, for 16bit and more, we have KUNPCKBW/D/Q. but for 8bit
> > > > unpack_lo/hi, only shift.
> > > > > > conversion.  Not sure how to better ask the target here - again
> > > > > > VnBImode might have been easier here.
> > > > >
> > > > > So I've managed to "emulate" the unpack_lo/hi for the case of
> > > > > !VECTOR_MODE_P masks by using sub-vector select (we're asking
> > > > > to turn vector(8) <signed-boolean:1> into two
> > > > > vector(4) <signed-boolean:1>) via BIT_FIELD_REF.  That then
> > > > > produces the desired single mask producer and
> > > > >
> > > > >   loop_mask_38 = VIEW_CONVERT_EXPR<vector(4) 
> > > > > <signed-boolean:1>>(loop_mask_54);
> > > > >   loop_mask_37 = BIT_FIELD_REF <loop_mask_54, 4, 4>;
> > > > >
> > > > > note for the lowpart we can just view-convert away the excess bits,
> > > > > fully re-using the mask.  We generate surprisingly "good" code:
> > > > >
> > > > >         kmovb   %k1, %edi
> > > > >         shrb    $4, %dil
> > > > >         kmovb   %edi, %k2
> > > > >
> > > > > besides the lack of using kshiftrb.  I guess we're just lacking
> > > > > a mask register alternative for
> > > > Yes, we can do it similar as kor/kand/kxor.
> > > > >
> > > > > (insn 22 20 25 4 (parallel [
> > > > >             (set (reg:QI 94 [ loop_mask_37 ])
> > > > >                 (lshiftrt:QI (reg:QI 98 [ loop_mask_54 ])
> > > > >                     (const_int 4 [0x4])))
> > > > >             (clobber (reg:CC 17 flags))
> > > > >         ]) 724 {*lshrqi3_1}
> > > > >      (expr_list:REG_UNUSED (reg:CC 17 flags)
> > > > >         (nil)))
> > > > >
> > > > > and so we reload.  For the above cited loop the AVX512 vectorization
> > > > > with --param vect-partial-vector-usage=1 does look quite sensible
> > > > > to me.  Instead of a SSE vectorized epilogue plus a scalar
> > > > > epilogue we get a single fully masked AVX512 "iteration" for both.
> > > > > I suppose it's still mostly a code-size optimization (384 bytes
> > > > > with the masked epiloge vs. 474 bytes with trunk) since it will
> > > > > be likely slower for very low iteration counts but it's good
> > > > > for icache usage then and good for less branch predictor usage.
> > > > >
> > > > > That said, I have to set up SPEC on a AVX512 machine to do
> > > > Does patch  land in trunk already, i can have a test on CLX.
> > >
> > > I'm still experimenting a bit right now but hope to get something
> > > trunk ready at the end of this or beginning next week.  Since it's
> > > disabled by default we can work on improving it during stage1 then.
> > >
> > > I'm mostly struggling with the GIMPLE IL to be used for the
> > > mask unpacking since we currently reject both the BIT_FIELD_REF
> > > and the VIEW_CONVERT we generate (why do AVX512 masks not all have
> > > SImode but sometimes QImode and sometimes HImode ...).  Unfortunately
> > > we've dropped whole-vector shifts in favor of VEC_PERM but that
> > > doesn't work well either for integer mode vectors.  So I'm still
> > > playing with my options here and looking for something that doesn't
> > > require too much surgery on the RTL side to recover good mask
> > > register code ...
> > >
> > > Another part missing is expanders for the various cond_* patterns
> > >
> > > OPTAB_D (cond_add_optab, "cond_add$a")
> > > OPTAB_D (cond_sub_optab, "cond_sub$a")
> > > OPTAB_D (cond_smul_optab, "cond_mul$a")
> > > OPTAB_D (cond_sdiv_optab, "cond_div$a")
> > > OPTAB_D (cond_smod_optab, "cond_mod$a")
> > > OPTAB_D (cond_udiv_optab, "cond_udiv$a")
> > > OPTAB_D (cond_umod_optab, "cond_umod$a")
> > > OPTAB_D (cond_and_optab, "cond_and$a")
> > > OPTAB_D (cond_ior_optab, "cond_ior$a")
> > > OPTAB_D (cond_xor_optab, "cond_xor$a")
> > > OPTAB_D (cond_ashl_optab, "cond_ashl$a")
> > > OPTAB_D (cond_ashr_optab, "cond_ashr$a")
> > > OPTAB_D (cond_lshr_optab, "cond_lshr$a")
> > > OPTAB_D (cond_smin_optab, "cond_smin$a")
> > > OPTAB_D (cond_smax_optab, "cond_smax$a")
> > > OPTAB_D (cond_umin_optab, "cond_umin$a")
> > > OPTAB_D (cond_umax_optab, "cond_umax$a")
> > > OPTAB_D (cond_fma_optab, "cond_fma$a")
> > > OPTAB_D (cond_fms_optab, "cond_fms$a")
> > > OPTAB_D (cond_fnma_optab, "cond_fnma$a")
> > > OPTAB_D (cond_fnms_optab, "cond_fnms$a")
> > >
> > > I think the most useful are those for possibly trapping ops
> > > (will be used by if-conversion) and those for reduction operations
> > > (add,min,max) which would enable a masked reduction epilogue.
> > I've added cond_add/sub/max/min/smax/smin with my local patch, but I
> > can't figure out testcases to validate them.
> > Any ideas?
>
> For example
>
> double a[1024], b[1024];
>
> void foo ()
> {
>   for (int i = 0; i < 1024; ++i)
>     if (b[i] < 3.)
>       a[i] = b[i] + 3.;
> }
>
Oh, thanks
the loop is successfully vectorized w/ cond_add expanders


        .cfi_startproc
        vbroadcastsd        .LC1(%rip), %ymm1
        xorl        %eax, %eax
        jmp        .L3
        .p2align 4,,10
        .p2align 3
.L2:
        addq        $32, %rax
        cmpq        $8192, %rax
        je        .L9
.L3:
        vmovapd        b(%rax), %ymm0
        vcmppd        $1, %ymm1, %ymm0, %k1
        kortestb        %k1, %k1
        je        .L2
        vaddpd        %ymm1, %ymm0, %ymm2{%k1}{z}
        vmovapd        %ymm2, a(%rax){%k1}
        addq        $32, %rax
        cmpq        $8192, %rax
        jne        .L3
.L9:
vzeroupper

Here's dump

  vector(4) double * vectp_a.10;
  vector(4) double * vectp_a.9;
  vector(4) double vect__2.8;
  vector(4) <signed-boolean:1> mask__22.7;
  vector(4) double vect__1.6;
  vector(4) double * vectp_b.5;
  vector(4) double * vectp_b.4;
  int i;
  double _1;
  double _2;
  unsigned int ivtmp_3;
  unsigned int ivtmp_5;
  _Bool _11;
  _Bool _22;
  double * _23;
  vector(4) double vect_cst__28;
  vector(4) double vect_cst__30;
  vector(4) double vect_cst__31;
  unsigned int ivtmp_36;
  unsigned int ivtmp_37;

  <bb 2> [local count: 10737416]:
  _11 = 1;
  vect_cst__28 = { 3.0e+0, 3.0e+0, 3.0e+0, 3.0e+0 };
  vect_cst__30 = { 3.0e+0, 3.0e+0, 3.0e+0, 3.0e+0 };
  vect_cst__31 = { 0.0, 0.0, 0.0, 0.0 };

  <bb 3> [local count: 268435396]:
  # i_12 = PHI <i_8(7), 0(2)>
  # ivtmp_5 = PHI <ivtmp_3(7), 1024(2)>
  # vectp_b.4_25 = PHI <vectp_b.4_26(7), &b(2)>
  # vectp_a.9_33 = PHI <vectp_a.9_34(7), &a(2)>
  # ivtmp_36 = PHI <ivtmp_37(7), 0(2)>
  vect__1.6_27 = MEM <vector(4) double> [(double *)vectp_b.4_25];
  _1 = b[i_12];
  mask__22.7_29 = vect__1.6_27 < vect_cst__28;
  if (mask__22.7_29 == { 0, 0, 0, 0 })
    goto <bb 20>; [100.00%]
  else
    goto <bb 21>; [20.00%]

  <bb 21> [local count: 53687078]:
  vect__2.8_32 = .COND_ADD (mask__22.7_29, vect__1.6_27, vect_cst__30,
vect_cst__31);  <--- Here.
  .MASK_STORE (vectp_a.9_33, 256B, mask__22.7_29, vect__2.8_32);

  <bb 20> [local count: 268435396]:
  i_8 = i_12 + 1;
  ivtmp_3 = ivtmp_5 - 1;
  vectp_b.4_26 = vectp_b.4_25 + 32;
  vectp_a.9_34 = vectp_a.9_33 + 32;
  ivtmp_37 = ivtmp_36 + 1;
  if (ivtmp_37 < 256)
    goto <bb 7>; [96.00%]
  else
    goto <bb 17>; [4.00%]

  <bb 7> [local count: 257697980]:
  goto <bb 3>; [100.00%]

  <bb 17> [local count: 10737416]:
  return;


> cannot be if-converted with -O3 due to -ftrapping-math and the
> add possibly trapping.  But with cond_add it should be if-converted
> and thus vectorized by making the add masked (in addition to the
> masked store).
>
> Richard.



-- 
BR,
Hongtao

Re: [PATCH 2/2][RFC] Add loop masking support for x86

Reply via email to