On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote: > Hi, Richard and Richi. > > Currently, we are statically returning vectorization factor in > 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE' > according to compile option. > > For example: > void > foo (int32_t *__restrict a, int32_t *__restrict b, int n) > { > for (int i = 0; i < n; i++) > a[i] = a[i] + b[i]; > } > > with --param=riscv-autovec-lmul = m1: > > vsetvli a5,a2,e32,m1,ta,ma > vle32.v v2,0(a0) > vle32.v v1,0(a1) > vsetvli a6,zero,e32,m1,ta,ma > slli a3,a5,2 > vadd.vv v1,v1,v2 > sub a2,a2,a5 > vsetvli zero,a5,e32,m1,ta,ma > vse32.v v1,0(a4) > add a0,a0,a3 > add a1,a1,a3 > add a4,a4,a3 > bne a2,zero,.L3 > > The 'vadd.vv' is only performing operations on a single register. > > with --param=riscv-autovec-lmul=m8: > > vsetvli a5,a2,e8,m2,ta,ma > vle32.v v16,0(a0) > vle32.v v8,0(a1) > vsetvli a6,zero,e32,m8,ta,ma > slli a3,a5,2 > vadd.vv v8,v8,v16 > vsetvli zero,a2,e32,m8,ta,ma > sub a2,a2,a5 > vse32.v v8,0(a4) > add a0,a0,a3 > add a1,a1,a3 > add a4,a4,a3 > bne a2,zero,.L3 > > The 'vadd.vv' here is performing operations on 8 consecutive registers: > > vadd.vv [v8 - v15], [v8 - v15], [v16 - v23] > > Users statically set the vectorization factor is not ideal. > > We want GCC to dynamic choose vectorization factor to do the > auto-vectorization according to loop analysis. > > Currently, I have implement simplistic loop analysis like analyze live range > of each local decl of current function. > > Here is the analysis, we have 32 vector registers for RVV. > So we calculate the live range of current function local decl: > > the number of decls live at the same time * LMUL <= 32. > According to this analysis, I set the vectorization factor in > TARGET_VECTORIZE_PREFERRED_SIMD_MODE > > Then this simplistic algorithm (implemented in RISC-V backend) work well for > the testcases I produces. > > However, I can only choose optimal vectorization for whole function but > failed to specific loop. > > Here is the example: > > void foo2 (int32_t *__restrict a, > int32_t *__restrict b, > int32_t *__restrict c, > int32_t *__restrict a2, > int32_t *__restrict b2, > int32_t *__restrict c2, > int32_t *__restrict a3, > int32_t *__restrict b3, > int32_t *__restrict c3, > int32_t *__restrict a4, > int32_t *__restrict b4, > int32_t *__restrict c4, > int32_t *__restrict a5, > int32_t *__restrict b5, > int32_t *__restrict c5, > int n) > { > // Loop 1 > for (int i = 0; i < n; i++) > a[i] = a[i] + b[i]; > // Loop 2 > for (int i = 0; i < n; i++){ > a[i] = b[i] + c[i]; > a2[i] = b2[i] + c2[i]; > a3[i] = b3[i] + c3[i]; > a4[i] = b4[i] + c4[i]; > a5[i] = a[i] + a4[i]; > a[i] = a3[i] + a2[i]+ a5[i]; > } > } > > Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 4 > (since LMUL = 8 will cause vector register spillings). > > If I split loop 1 and loop 2 into 2 separate functions, my algorithm works > well. > > However, if we put these 2 loop in the same function, I finally pick LMUL = 4 > for both loop 1 and loop 2 since as I said above, I do the analysis base on > function not base > on the loop. > > I am struggling whether we could have a good idea for such issue. Can we pass > through loop_vec_info > to 'preferred_simd_mode' target hook?
That's not how it's currently designed to work - there's the autovectorize_vector_modes hook where you should provide a vector of modes the vectorizer iterates over and return VECT_COMPARE_COST if you want to evaluate costs between choices. Your analysis should then happen in the finish_cost method. That's how it's currently designed. It might not be optimal for compile-time reasons when there are many modes, giving the target more control (and context) might be possible. Richard. > Thanks. > > > juzhe.zh...@rivai.ai > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)