Hi, Richi. /* Keep track of the VF for each mode. Initialize all to 0 which indicates a mode has not been analyzed. */ auto_vec<poly_uint64, 8> cached_vf_per_mode; for (unsigned i = 0; i < vector_modes.length (); ++i) cached_vf_per_mode.safe_push (0);
I saw codes here: the 'cached_vf_per_mode' is allocated size '8'. But for RVV, I will need to push these following modes: RVVM8QI, RVVM4QI, RVVM2QI, RVVM1QI, V128QI, V64QI, V32QI, V16QI, V8QI, V4QI, V2QI There are 11 modes. Should I increase the number from 8 to 11? Thanks. juzhe.zh...@rivai.ai From: Richard Biener Date: 2023-08-31 19:29 To: juzhe.zh...@rivai.ai CC: gcc; richard.sandiford Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote: > Hi. Thanks Richard and Richi. > > Now, I figure out how to choose smaller LMUL now. > > void > costs::finish_cost (const vector_costs *scalar_costs) > { > loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); > if (loop_vinfo) > { > if (loop_vinfo->vector_mode == RVVM8SImode > || riscv_v_ext_vls_mode_p (loop_vinfo->vector_mode)) > { > m_costs[vect_prologue] = 8; > m_costs[vect_body] = 8; > m_costs[vect_epilogue] = 8; > } > else > { > m_costs[vect_prologue] = 1; > m_costs[vect_body] = 1; > m_costs[vect_epilogue] = 1; > } > } > // m_suggested_unroll_factor = 2; > vector_costs::finish_cost (scalar_costs); > } I don't think that's "good" use of the API. > Previous odd codes are because of VLS modes > > Now, I can get the LMUL = 4 by adjusting cost. > vsetvli a5,a2,e32,m4,ta,ma > vle32.v v8,0(a0) > vle32.v v4,0(a1) > vsetvli a6,zero,e32,m4,ta,ma > slli a3,a5,2 > vadd.vv v4,v4,v8 > sub a2,a2,a5 > vsetvli zero,a5,e32,m4,ta,ma > vse32.v v4,0(a4) > add a0,a0,a3 > add a1,a1,a3 > add a4,a4,a3 > bne a2,zero,.L3 > > Fantastic architecture of GCC Vector Cost model! > > Thanks a lot. > > > juzhe.zh...@rivai.ai > > From: Richard Biener > Date: 2023-08-31 19:20 > To: juzhe.zh...@rivai.ai > CC: gcc; richard.sandiford > Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV > On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote: > > > Thanks Richi. > > > > I am trying to figure out how to adjust finish_cost to lower the LMUL > > > > For example: > > > > void > > foo (int32_t *__restrict a, int32_t *__restrict b, int n) > > { > > for (int i = 0; i < n; i++) > > a[i] = a[i] + b[i]; > > } > > > > preferred_simd_mode pick LMUL = 8 (RVVM8SImode) > > > > Is is possible that we can adjust the COST in finish cost make Loop > > vectorizer pick LMUL = 4? > > I see you have a autovectorize_vector_modes hook and you use > VECT_COMPARE_COSTS. So the appropriate place would be to > amend your vector_costs::better_main_loop_than_p. > > > I am experimenting with this following cost: > > > > if (loop_vinfo) > > { > > if (loop_vinfo->vector_mode == RVVM8SImode) > > { > > m_costs[vect_prologue] = 2; > > m_costs[vect_body] = 20; > > m_costs[vect_epilogue] = 2; > > } > > else > > { > > m_costs[vect_prologue] = 1; > > m_costs[vect_body] = 1; > > m_costs[vect_epilogue] = 1; > > } > > } > > > > I increase LMUL = 8 cost. The codegen is odd: > > > > foo: > > ble a2,zero,.L12 > > addiw a5,a2,-1 > > li a4,30 > > sext.w t1,a2 > > bleu a5,a4,.L7 > > srliw a7,t1,5 > > slli a7,a7,7 > > li a4,32 > > add a7,a7,a0 > > mv a5,a0 > > mv a3,a1 > > vsetvli zero,a4,e32,m8,ta,ma > > .L4: > > vle32.v v8,0(a5) > > vle32.v v16,0(a3) > > vadd.vv v8,v8,v16 > > vse32.v v8,0(a5) > > addi a5,a5,128 > > addi a3,a3,128 > > bne a5,a7,.L4 > > andi a2,a2,-32 > > beq t1,a2,.L14 > > .L3: > > slli a4,a2,32 > > subw a5,t1,a2 > > srli a4,a4,32 > > slli a5,a5,32 > > slli a4,a4,2 > > srli a5,a5,32 > > add a0,a0,a4 > > add a1,a1,a4 > > vsetvli a4,a5,e8,m1,ta,ma > > vle32.v v8,0(a0) > > vle32.v v4,0(a1) > > vsetvli a2,zero,e32,m4,ta,ma > > vadd.vv v4,v4,v8 > > vsetvli zero,a5,e32,m4,ta,ma > > vse32.v v4,0(a0) > > sub a3,a5,a4 > > beq a5,a4,.L12 > > slli a4,a4,2 > > vsetvli zero,a3,e8,m1,ta,ma > > add a0,a0,a4 > > add a1,a1,a4 > > vle32.v v4,0(a0) > > vle32.v v8,0(a1) > > vsetvli a2,zero,e32,m4,ta,ma > > vadd.vv v4,v4,v8 > > vsetvli zero,a3,e32,m4,ta,ma > > vse32.v v4,0(a0) > > .L12: > > ret > > .L7: > > li a2,0 > > j .L3 > > .L14: > > ret > > > > I hope it can generate the code like this: > > > > foo: > > ble a2,zero,.L5 > > mv a4,a0 > > .L3: > > vsetvli a5,a2,e32,m4,ta,ma > > vle32.v v8,0(a0) > > vle32.v v4,0(a1) > > vsetvli a6,zero,e32,m4,ta,ma > > slli a3,a5,2 > > vadd.vv v4,v4,v8 > > sub a2,a2,a5 > > vsetvli zero,a5,e32,m4,ta,ma > > vse32.v v4,0(a4) > > add a0,a0,a3 > > add a1,a1,a3 > > add a4,a4,a3 > > bne a2,zero,.L3 > > .L5: > > ret > > > > I am experimenting whether we can adjust cost statically to make loop > > vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8. > > If we can do that, I think we can apply analysis and then adjust the > > cost according to analysis. > > > > Thanks. > > > > > > juzhe.zh...@rivai.ai > > > > From: Richard Biener > > Date: 2023-08-31 15:38 > > To: juzhe.zh...@rivai.ai > > CC: gcc; richard.sandiford > > Subject: Re: Question about dynamic choosing vectorization factor for RVV > > On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote: > > > > > Hi, Richard and Richi. > > > > > > Currently, we are statically returning vectorization factor in > > > 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE' > > > according to compile option. > > > > > > For example: > > > void > > > foo (int32_t *__restrict a, int32_t *__restrict b, int n) > > > { > > > for (int i = 0; i < n; i++) > > > a[i] = a[i] + b[i]; > > > } > > > > > > with --param=riscv-autovec-lmul = m1: > > > > > > vsetvli a5,a2,e32,m1,ta,ma > > > vle32.v v2,0(a0) > > > vle32.v v1,0(a1) > > > vsetvli a6,zero,e32,m1,ta,ma > > > slli a3,a5,2 > > > vadd.vv v1,v1,v2 > > > sub a2,a2,a5 > > > vsetvli zero,a5,e32,m1,ta,ma > > > vse32.v v1,0(a4) > > > add a0,a0,a3 > > > add a1,a1,a3 > > > add a4,a4,a3 > > > bne a2,zero,.L3 > > > > > > The 'vadd.vv' is only performing operations on a single register. > > > > > > with --param=riscv-autovec-lmul=m8: > > > > > > vsetvli a5,a2,e8,m2,ta,ma > > > vle32.v v16,0(a0) > > > vle32.v v8,0(a1) > > > vsetvli a6,zero,e32,m8,ta,ma > > > slli a3,a5,2 > > > vadd.vv v8,v8,v16 > > > vsetvli zero,a2,e32,m8,ta,ma > > > sub a2,a2,a5 > > > vse32.v v8,0(a4) > > > add a0,a0,a3 > > > add a1,a1,a3 > > > add a4,a4,a3 > > > bne a2,zero,.L3 > > > > > > The 'vadd.vv' here is performing operations on 8 consecutive registers: > > > > > > vadd.vv [v8 - v15], [v8 - v15], [v16 - v23] > > > > > > Users statically set the vectorization factor is not ideal. > > > > > > We want GCC to dynamic choose vectorization factor to do the > > > auto-vectorization according to loop analysis. > > > > > > Currently, I have implement simplistic loop analysis like analyze live > > > range of each local decl of current function. > > > > > > Here is the analysis, we have 32 vector registers for RVV. > > > So we calculate the live range of current function local decl: > > > > > > the number of decls live at the same time * LMUL <= 32. > > > According to this analysis, I set the vectorization factor in > > > TARGET_VECTORIZE_PREFERRED_SIMD_MODE > > > > > > Then this simplistic algorithm (implemented in RISC-V backend) work well > > > for the testcases I produces. > > > > > > However, I can only choose optimal vectorization for whole function but > > > failed to specific loop. > > > > > > Here is the example: > > > > > > void foo2 (int32_t *__restrict a, > > > int32_t *__restrict b, > > > int32_t *__restrict c, > > > int32_t *__restrict a2, > > > int32_t *__restrict b2, > > > int32_t *__restrict c2, > > > int32_t *__restrict a3, > > > int32_t *__restrict b3, > > > int32_t *__restrict c3, > > > int32_t *__restrict a4, > > > int32_t *__restrict b4, > > > int32_t *__restrict c4, > > > int32_t *__restrict a5, > > > int32_t *__restrict b5, > > > int32_t *__restrict c5, > > > int n) > > > { > > > // Loop 1 > > > for (int i = 0; i < n; i++) > > > a[i] = a[i] + b[i]; > > > // Loop 2 > > > for (int i = 0; i < n; i++){ > > > a[i] = b[i] + c[i]; > > > a2[i] = b2[i] + c2[i]; > > > a3[i] = b3[i] + c3[i]; > > > a4[i] = b4[i] + c4[i]; > > > a5[i] = a[i] + a4[i]; > > > a[i] = a3[i] + a2[i]+ a5[i]; > > > } > > > } > > > > > > Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL > > > = 4 (since LMUL = 8 will cause vector register spillings). > > > > > > If I split loop 1 and loop 2 into 2 separate functions, my algorithm > > > works well. > > > > > > However, if we put these 2 loop in the same function, I finally pick LMUL > > > = 4 for both loop 1 and loop 2 since as I said above, I do the analysis > > > base on function not base > > > on the loop. > > > > > > I am struggling whether we could have a good idea for such issue. Can we > > > pass through loop_vec_info > > > to 'preferred_simd_mode' target hook? > > > > That's not how it's currently designed to work - there's > > the autovectorize_vector_modes hook where you should provide a vector > > of modes the vectorizer iterates over and return VECT_COMPARE_COST > > if you want to evaluate costs between choices. Your analysis should > > then happen in the finish_cost method. > > > > That's how it's currently designed. It might not be optimal for > > compile-time reasons when there are many modes, giving the target > > more control (and context) might be possible. > > > > Richard. > > > > > Thanks. > > > > > > > > > juzhe.zh...@rivai.ai > > > > > > > > > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)