On Wed, 21 Jul 2021, Hongtao Liu wrote: > On Wed, Jul 21, 2021 at 4:16 PM Richard Biener <rguent...@suse.de> wrote: > > > > On Wed, 21 Jul 2021, Hongtao Liu wrote: > > > > > On Tue, Jul 20, 2021 at 3:38 PM Richard Biener <rguent...@suse.de> wrote: > > > > > > > > On Tue, 20 Jul 2021, Hongtao Liu wrote: > > > > > > > > > On Fri, Jul 16, 2021 at 5:11 PM Richard Biener <rguent...@suse.de> > > > > > wrote: > > > > > > > > > > > > On Thu, 15 Jul 2021, Richard Biener wrote: > > > > > > > > > > > > > On Thu, 15 Jul 2021, Richard Biener wrote: > > > > > > > > > > > > > > > OK, guess I was more looking at > > > > > > > > > > > > > > > > #define N 32 > > > > > > > > int foo (unsigned long *a, unsigned long * __restrict b, > > > > > > > > unsigned int *c, unsigned int * __restrict d, > > > > > > > > int n) > > > > > > > > { > > > > > > > > unsigned sum = 1; > > > > > > > > for (int i = 0; i < n; ++i) > > > > > > > > { > > > > > > > > b[i] += a[i]; > > > > > > > > d[i] += c[i]; > > > > > > > > } > > > > > > > > return sum; > > > > > > > > } > > > > > > > > > > > > > > > > where we on x86 AVX512 vectorize with V8DI and V16SI and we > > > > > > > > generate two masks for the two copies of V8DI (VF is 16) and one > > > > > > > > mask for V16SI. With SVE I see > > > > > > > > > > > > > > > > punpklo p1.h, p0.b > > > > > > > > punpkhi p2.h, p0.b > > > > > > > > > > > > > > > > that's sth I expected to see for AVX512 as well, using the V16SI > > > > > > > > mask and unpacking that to two V8DI ones. But I see > > > > > > > > > > > > > > > > vpbroadcastd %eax, %ymm0 > > > > > > > > vpaddd %ymm12, %ymm0, %ymm0 > > > > > > > > vpcmpud $6, %ymm0, %ymm11, %k3 > > > > > > > > vpbroadcastd %eax, %xmm0 > > > > > > > > vpaddd %xmm10, %xmm0, %xmm0 > > > > > > > > vpcmpud $1, %xmm7, %xmm0, %k1 > > > > > > > > vpcmpud $6, %xmm0, %xmm8, %k2 > > > > > > > > kortestb %k1, %k1 > > > > > > > > jne .L3 > > > > > > > > > > > > > > > > so three %k masks generated by vpcmpud. I'll have to look > > > > > > > > what's > > > > > > > > the magic for SVE and why that doesn't trigger for x86 here. > > > > > > > > > > > > > > So answer myself, vect_maybe_permute_loop_masks looks for > > > > > > > vec_unpacku_hi/lo_optab, but with AVX512 the vector bools have > > > > > > > QImode so that doesn't play well here. Not sure if there > > > > > > > are proper mask instructions to use (I guess there's a shift > > > > > > > and lopart is free). This is QI:8 to two QI:4 (bits) mask > > > > > Yes, for 16bit and more, we have KUNPCKBW/D/Q. but for 8bit > > > > > unpack_lo/hi, only shift. > > > > > > > conversion. Not sure how to better ask the target here - again > > > > > > > VnBImode might have been easier here. > > > > > > > > > > > > So I've managed to "emulate" the unpack_lo/hi for the case of > > > > > > !VECTOR_MODE_P masks by using sub-vector select (we're asking > > > > > > to turn vector(8) <signed-boolean:1> into two > > > > > > vector(4) <signed-boolean:1>) via BIT_FIELD_REF. That then > > > > > > produces the desired single mask producer and > > > > > > > > > > > > loop_mask_38 = VIEW_CONVERT_EXPR<vector(4) > > > > > > <signed-boolean:1>>(loop_mask_54); > > > > > > loop_mask_37 = BIT_FIELD_REF <loop_mask_54, 4, 4>; > > > > > > > > > > > > note for the lowpart we can just view-convert away the excess bits, > > > > > > fully re-using the mask. We generate surprisingly "good" code: > > > > > > > > > > > > kmovb %k1, %edi > > > > > > shrb $4, %dil > > > > > > kmovb %edi, %k2 > > > > > > > > > > > > besides the lack of using kshiftrb. I guess we're just lacking > > > > > > a mask register alternative for > > > > > Yes, we can do it similar as kor/kand/kxor. > > > > > > > > > > > > (insn 22 20 25 4 (parallel [ > > > > > > (set (reg:QI 94 [ loop_mask_37 ]) > > > > > > (lshiftrt:QI (reg:QI 98 [ loop_mask_54 ]) > > > > > > (const_int 4 [0x4]))) > > > > > > (clobber (reg:CC 17 flags)) > > > > > > ]) 724 {*lshrqi3_1} > > > > > > (expr_list:REG_UNUSED (reg:CC 17 flags) > > > > > > (nil))) > > > > > > > > > > > > and so we reload. For the above cited loop the AVX512 vectorization > > > > > > with --param vect-partial-vector-usage=1 does look quite sensible > > > > > > to me. Instead of a SSE vectorized epilogue plus a scalar > > > > > > epilogue we get a single fully masked AVX512 "iteration" for both. > > > > > > I suppose it's still mostly a code-size optimization (384 bytes > > > > > > with the masked epiloge vs. 474 bytes with trunk) since it will > > > > > > be likely slower for very low iteration counts but it's good > > > > > > for icache usage then and good for less branch predictor usage. > > > > > > > > > > > > That said, I have to set up SPEC on a AVX512 machine to do > > > > > Does patch land in trunk already, i can have a test on CLX. > > > > > > > > I'm still experimenting a bit right now but hope to get something > > > > trunk ready at the end of this or beginning next week. Since it's > > > > disabled by default we can work on improving it during stage1 then. > > > > > > > > I'm mostly struggling with the GIMPLE IL to be used for the > > > > mask unpacking since we currently reject both the BIT_FIELD_REF > > > > and the VIEW_CONVERT we generate (why do AVX512 masks not all have > > > > SImode but sometimes QImode and sometimes HImode ...). Unfortunately > > > > we've dropped whole-vector shifts in favor of VEC_PERM but that > > > > doesn't work well either for integer mode vectors. So I'm still > > > > playing with my options here and looking for something that doesn't > > > > require too much surgery on the RTL side to recover good mask > > > > register code ... > > > > > > > > Another part missing is expanders for the various cond_* patterns > > > > > > > > OPTAB_D (cond_add_optab, "cond_add$a") > > > > OPTAB_D (cond_sub_optab, "cond_sub$a") > > > > OPTAB_D (cond_smul_optab, "cond_mul$a") > > > > OPTAB_D (cond_sdiv_optab, "cond_div$a") > > > > OPTAB_D (cond_smod_optab, "cond_mod$a") > > > > OPTAB_D (cond_udiv_optab, "cond_udiv$a") > > > > OPTAB_D (cond_umod_optab, "cond_umod$a") > > > > OPTAB_D (cond_and_optab, "cond_and$a") > > > > OPTAB_D (cond_ior_optab, "cond_ior$a") > > > > OPTAB_D (cond_xor_optab, "cond_xor$a") > > > > OPTAB_D (cond_ashl_optab, "cond_ashl$a") > > > > OPTAB_D (cond_ashr_optab, "cond_ashr$a") > > > > OPTAB_D (cond_lshr_optab, "cond_lshr$a") > > > > OPTAB_D (cond_smin_optab, "cond_smin$a") > > > > OPTAB_D (cond_smax_optab, "cond_smax$a") > > > > OPTAB_D (cond_umin_optab, "cond_umin$a") > > > > OPTAB_D (cond_umax_optab, "cond_umax$a") > > > > OPTAB_D (cond_fma_optab, "cond_fma$a") > > > > OPTAB_D (cond_fms_optab, "cond_fms$a") > > > > OPTAB_D (cond_fnma_optab, "cond_fnma$a") > > > > OPTAB_D (cond_fnms_optab, "cond_fnms$a") > > > > > > > > I think the most useful are those for possibly trapping ops > > > > (will be used by if-conversion) and those for reduction operations > > > > (add,min,max) which would enable a masked reduction epilogue. > > > I've added cond_add/sub/max/min/smax/smin with my local patch, but I > > > can't figure out testcases to validate them. > > > Any ideas? > > > > For example > > > > double a[1024], b[1024]; > > > > void foo () > > { > > for (int i = 0; i < 1024; ++i) > > if (b[i] < 3.) > > a[i] = b[i] + 3.; > > } > > > Oh, thanks > the loop is successfully vectorized w/ cond_add expanders > > .cfi_startproc > vbroadcastsd .LC1(%rip), %ymm1 > xorl %eax, %eax > jmp .L3 > .p2align 4,,10 > .p2align 3 > .L2: > addq $32, %rax > cmpq $8192, %rax > je .L9 > .L3: > vmovapd b(%rax), %ymm0 > vcmppd $1, %ymm1, %ymm0, %k1 > kortestb %k1, %k1 > je .L2 > vaddpd %ymm1, %ymm0, %ymm2{%k1}{z} > vmovapd %ymm2, a(%rax){%k1} > addq $32, %rax > cmpq $8192, %rax > jne .L3 > .L9: > vzeroupper > > Here's dump > > vector(4) double * vectp_a.10; > vector(4) double * vectp_a.9; > vector(4) double vect__2.8; > vector(4) <signed-boolean:1> mask__22.7; > vector(4) double vect__1.6; > vector(4) double * vectp_b.5; > vector(4) double * vectp_b.4; > int i; > double _1; > double _2; > unsigned int ivtmp_3; > unsigned int ivtmp_5; > _Bool _11; > _Bool _22; > double * _23; > vector(4) double vect_cst__28; > vector(4) double vect_cst__30; > vector(4) double vect_cst__31; > unsigned int ivtmp_36; > unsigned int ivtmp_37; > > <bb 2> [local count: 10737416]: > _11 = 1; > vect_cst__28 = { 3.0e+0, 3.0e+0, 3.0e+0, 3.0e+0 }; > vect_cst__30 = { 3.0e+0, 3.0e+0, 3.0e+0, 3.0e+0 }; > vect_cst__31 = { 0.0, 0.0, 0.0, 0.0 }; > > <bb 3> [local count: 268435396]: > # i_12 = PHI <i_8(7), 0(2)> > # ivtmp_5 = PHI <ivtmp_3(7), 1024(2)> > # vectp_b.4_25 = PHI <vectp_b.4_26(7), &b(2)> > # vectp_a.9_33 = PHI <vectp_a.9_34(7), &a(2)> > # ivtmp_36 = PHI <ivtmp_37(7), 0(2)> > vect__1.6_27 = MEM <vector(4) double> [(double *)vectp_b.4_25]; > _1 = b[i_12]; > mask__22.7_29 = vect__1.6_27 < vect_cst__28; > if (mask__22.7_29 == { 0, 0, 0, 0 }) > goto <bb 20>; [100.00%] > else > goto <bb 21>; [20.00%] > > <bb 21> [local count: 53687078]: > vect__2.8_32 = .COND_ADD (mask__22.7_29, vect__1.6_27, vect_cst__30, > vect_cst__31); <--- Here. > .MASK_STORE (vectp_a.9_33, 256B, mask__22.7_29, vect__2.8_32); > > <bb 20> [local count: 268435396]: > i_8 = i_12 + 1; > ivtmp_3 = ivtmp_5 - 1; > vectp_b.4_26 = vectp_b.4_25 + 32; > vectp_a.9_34 = vectp_a.9_33 + 32; > ivtmp_37 = ivtmp_36 + 1; > if (ivtmp_37 < 256) > goto <bb 7>; [96.00%] > else > goto <bb 17>; [4.00%] > > <bb 7> [local count: 257697980]: > goto <bb 3>; [100.00%] > > <bb 17> [local count: 10737416]: > return;
Looks great! Thanks, Richard.