On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote: > Thanks Richi. > > I am trying to figure out how to adjust finish_cost to lower the LMUL > > For example: > > void > foo (int32_t *__restrict a, int32_t *__restrict b, int n) > { > for (int i = 0; i < n; i++) > a[i] = a[i] + b[i]; > } > > preferred_simd_mode pick LMUL = 8 (RVVM8SImode) > > Is is possible that we can adjust the COST in finish cost make Loop > vectorizer pick LMUL = 4?
I see you have a autovectorize_vector_modes hook and you use VECT_COMPARE_COSTS. So the appropriate place would be to amend your vector_costs::better_main_loop_than_p. > I am experimenting with this following cost: > > if (loop_vinfo) > { > if (loop_vinfo->vector_mode == RVVM8SImode) > { > m_costs[vect_prologue] = 2; > m_costs[vect_body] = 20; > m_costs[vect_epilogue] = 2; > } > else > { > m_costs[vect_prologue] = 1; > m_costs[vect_body] = 1; > m_costs[vect_epilogue] = 1; > } > } > > I increase LMUL = 8 cost. The codegen is odd: > > foo: > ble a2,zero,.L12 > addiw a5,a2,-1 > li a4,30 > sext.w t1,a2 > bleu a5,a4,.L7 > srliw a7,t1,5 > slli a7,a7,7 > li a4,32 > add a7,a7,a0 > mv a5,a0 > mv a3,a1 > vsetvli zero,a4,e32,m8,ta,ma > .L4: > vle32.v v8,0(a5) > vle32.v v16,0(a3) > vadd.vv v8,v8,v16 > vse32.v v8,0(a5) > addi a5,a5,128 > addi a3,a3,128 > bne a5,a7,.L4 > andi a2,a2,-32 > beq t1,a2,.L14 > .L3: > slli a4,a2,32 > subw a5,t1,a2 > srli a4,a4,32 > slli a5,a5,32 > slli a4,a4,2 > srli a5,a5,32 > add a0,a0,a4 > add a1,a1,a4 > vsetvli a4,a5,e8,m1,ta,ma > vle32.v v8,0(a0) > vle32.v v4,0(a1) > vsetvli a2,zero,e32,m4,ta,ma > vadd.vv v4,v4,v8 > vsetvli zero,a5,e32,m4,ta,ma > vse32.v v4,0(a0) > sub a3,a5,a4 > beq a5,a4,.L12 > slli a4,a4,2 > vsetvli zero,a3,e8,m1,ta,ma > add a0,a0,a4 > add a1,a1,a4 > vle32.v v4,0(a0) > vle32.v v8,0(a1) > vsetvli a2,zero,e32,m4,ta,ma > vadd.vv v4,v4,v8 > vsetvli zero,a3,e32,m4,ta,ma > vse32.v v4,0(a0) > .L12: > ret > .L7: > li a2,0 > j .L3 > .L14: > ret > > I hope it can generate the code like this: > > foo: > ble a2,zero,.L5 > mv a4,a0 > .L3: > vsetvli a5,a2,e32,m4,ta,ma > vle32.v v8,0(a0) > vle32.v v4,0(a1) > vsetvli a6,zero,e32,m4,ta,ma > slli a3,a5,2 > vadd.vv v4,v4,v8 > sub a2,a2,a5 > vsetvli zero,a5,e32,m4,ta,ma > vse32.v v4,0(a4) > add a0,a0,a3 > add a1,a1,a3 > add a4,a4,a3 > bne a2,zero,.L3 > .L5: > ret > > I am experimenting whether we can adjust cost statically to make loop > vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8. > If we can do that, I think we can apply analysis and then adjust the > cost according to analysis. > > Thanks. > > > juzhe.zh...@rivai.ai > > From: Richard Biener > Date: 2023-08-31 15:38 > To: juzhe.zh...@rivai.ai > CC: gcc; richard.sandiford > Subject: Re: Question about dynamic choosing vectorization factor for RVV > On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote: > > > Hi, Richard and Richi. > > > > Currently, we are statically returning vectorization factor in > > 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE' > > according to compile option. > > > > For example: > > void > > foo (int32_t *__restrict a, int32_t *__restrict b, int n) > > { > > for (int i = 0; i < n; i++) > > a[i] = a[i] + b[i]; > > } > > > > with --param=riscv-autovec-lmul = m1: > > > > vsetvli a5,a2,e32,m1,ta,ma > > vle32.v v2,0(a0) > > vle32.v v1,0(a1) > > vsetvli a6,zero,e32,m1,ta,ma > > slli a3,a5,2 > > vadd.vv v1,v1,v2 > > sub a2,a2,a5 > > vsetvli zero,a5,e32,m1,ta,ma > > vse32.v v1,0(a4) > > add a0,a0,a3 > > add a1,a1,a3 > > add a4,a4,a3 > > bne a2,zero,.L3 > > > > The 'vadd.vv' is only performing operations on a single register. > > > > with --param=riscv-autovec-lmul=m8: > > > > vsetvli a5,a2,e8,m2,ta,ma > > vle32.v v16,0(a0) > > vle32.v v8,0(a1) > > vsetvli a6,zero,e32,m8,ta,ma > > slli a3,a5,2 > > vadd.vv v8,v8,v16 > > vsetvli zero,a2,e32,m8,ta,ma > > sub a2,a2,a5 > > vse32.v v8,0(a4) > > add a0,a0,a3 > > add a1,a1,a3 > > add a4,a4,a3 > > bne a2,zero,.L3 > > > > The 'vadd.vv' here is performing operations on 8 consecutive registers: > > > > vadd.vv [v8 - v15], [v8 - v15], [v16 - v23] > > > > Users statically set the vectorization factor is not ideal. > > > > We want GCC to dynamic choose vectorization factor to do the > > auto-vectorization according to loop analysis. > > > > Currently, I have implement simplistic loop analysis like analyze live > > range of each local decl of current function. > > > > Here is the analysis, we have 32 vector registers for RVV. > > So we calculate the live range of current function local decl: > > > > the number of decls live at the same time * LMUL <= 32. > > According to this analysis, I set the vectorization factor in > > TARGET_VECTORIZE_PREFERRED_SIMD_MODE > > > > Then this simplistic algorithm (implemented in RISC-V backend) work well > > for the testcases I produces. > > > > However, I can only choose optimal vectorization for whole function but > > failed to specific loop. > > > > Here is the example: > > > > void foo2 (int32_t *__restrict a, > > int32_t *__restrict b, > > int32_t *__restrict c, > > int32_t *__restrict a2, > > int32_t *__restrict b2, > > int32_t *__restrict c2, > > int32_t *__restrict a3, > > int32_t *__restrict b3, > > int32_t *__restrict c3, > > int32_t *__restrict a4, > > int32_t *__restrict b4, > > int32_t *__restrict c4, > > int32_t *__restrict a5, > > int32_t *__restrict b5, > > int32_t *__restrict c5, > > int n) > > { > > // Loop 1 > > for (int i = 0; i < n; i++) > > a[i] = a[i] + b[i]; > > // Loop 2 > > for (int i = 0; i < n; i++){ > > a[i] = b[i] + c[i]; > > a2[i] = b2[i] + c2[i]; > > a3[i] = b3[i] + c3[i]; > > a4[i] = b4[i] + c4[i]; > > a5[i] = a[i] + a4[i]; > > a[i] = a3[i] + a2[i]+ a5[i]; > > } > > } > > > > Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = > > 4 (since LMUL = 8 will cause vector register spillings). > > > > If I split loop 1 and loop 2 into 2 separate functions, my algorithm works > > well. > > > > However, if we put these 2 loop in the same function, I finally pick LMUL = > > 4 for both loop 1 and loop 2 since as I said above, I do the analysis base > > on function not base > > on the loop. > > > > I am struggling whether we could have a good idea for such issue. Can we > > pass through loop_vec_info > > to 'preferred_simd_mode' target hook? > > That's not how it's currently designed to work - there's > the autovectorize_vector_modes hook where you should provide a vector > of modes the vectorizer iterates over and return VECT_COMPARE_COST > if you want to evaluate costs between choices. Your analysis should > then happen in the finish_cost method. > > That's how it's currently designed. It might not be optimal for > compile-time reasons when there are many modes, giving the target > more control (and context) might be possible. > > Richard. > > > Thanks. > > > > > > juzhe.zh...@rivai.ai > > > > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)