Can I push code to GCC?
Hi, I am a compiler engineer base on GCC in China. Recently, I develop a new pattern for vector widen multiply-accumulator(I call it widen fma) because the ISA in my project has the instruction (vwmacc). I develop a lot of code especiallly frontend (gimple&&generic) int GCC. This is useful for my project, but I am not sure whether it is useful for GCC overall. Can I deliver the code? And maybe you guys can check my code and refine it to have a better quality. Thank you! juzhe.zh...@rivai.ai
Re: Ju-Zhe Zhong and Robin Dapp as RISC-V reviewers
Thanks Jeff. I will wait after Robin updated his MAINTAINERS (since I don't known what information I need put in). juzhe.zh...@rivai.ai From: Jeff Law Date: 2023-07-18 00:54 To: GCC Development CC: juzhe.zh...@rivai.ai; Robin Dapp Subject: Ju-Zhe Zhong and Robin Dapp as RISC-V reviewers I am pleased to announce that the GCC Steering Committee has appointed Ju-Zhe Zhong and Robin Dapp as reviewers for the RISC-V port. Ju-Zhe and Robin, can you both updated your MAINTAINERS entry appropriately. Thanks, Jeff
Question about wrapv-vect-reduc-dot-s8b.c
Hi, I start to enable "vect" testsuite for RISC-V. I have a question when analyzing the 'wrapv-vect-reduc-dot-s8b.c' It failed at: FAIL: gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c scan-tree-dump-times vect "vect_recog_dot_prod_pattern: detected" 1 FAIL: gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c scan-tree-dump-times vect "vect_recog_widen_mult_pattern: detected" 1 They are found "2" times. Since at the first time, it failed at the vectorization of conversion: wrapv-vect-reduc-dot-s8b.c:29:14: missed: conversion not supported by target. wrapv-vect-reduc-dot-s8b.c:29:14: note: vect_is_simple_use: operand X[i_14], type of def: internal wrapv-vect-reduc-dot-s8b.c:29:14: note: vect_is_simple_use: vectype vector([16,16]) signed char wrapv-vect-reduc-dot-s8b.c:29:14: note: vect_is_simple_use: operand X[i_14], type of def: internal wrapv-vect-reduc-dot-s8b.c:29:14: note: vect_is_simple_use: vectype vector([16,16]) signed char wrapv-vect-reduc-dot-s8b.c:30:17: missed: not vectorized: relevant stmt not supported: _2 = (short int) _1; wrapv-vect-reduc-dot-s8b.c:29:14: missed: bad operation or unsupported loop bound. Here loop vectorizer is trying to do the conversion from char -> short with both same nunits. But we don't support 'vec_unpack' stuff in RISC-V backend since I don't see the case that vec_unpack can optimize the codegen of autovectorizatio for RVV. To fix it, is it necessary to support 'vec_unpack' ? Thanks. juzhe.zh...@rivai.ai
Re: Re: Question about wrapv-vect-reduc-dot-s8b.c
Thanks Richi. >> both same units would be sext, not vec_unpacks_{lo,hi} - the vectorizer Sorry I made a mistake here. They are not the same nunits. wrapv-vect-reduc-dot-s8b.c:29:14: note: get vectype for scalar type: short int wrapv-vect-reduc-dot-s8b.c:29:14: note: vectype: vector([8,8]) short int wrapv-vect-reduc-dot-s8b.c:29:14: note: nunits = [8,8] wrapv-vect-reduc-dot-s8b.c:29:14: note: ==> examining statement: _1 = X[i_14]; wrapv-vect-reduc-dot-s8b.c:29:14: note: precomputed vectype: vector([16,16]) signed char wrapv-vect-reduc-dot-s8b.c:29:14: note: nunits = [16,16] wrapv-vect-reduc-dot-s8b.c:29:14: note: ==> examining statement: _2 = (short int) _1; wrapv-vect-reduc-dot-s8b.c:29:14: note: get vectype for scalar type: short int wrapv-vect-reduc-dot-s8b.c:29:14: note: vectype: vector([8,8]) short int wrapv-vect-reduc-dot-s8b.c:29:14: note: get vectype for smallest scalar type: signed char wrapv-vect-reduc-dot-s8b.c:29:14: note: nunits vectype: vector([16,16]) signed char wrapv-vect-reduc-dot-s8b.c:29:14: note: nunits = [16,16] Turns out for _2, it picks vector([8,8]) short int and _1, it picks vector([16,16]) signed char at the first time analysis. It seems that because we don't support vec_unpacks so that the first time analysis failed ? Then we end up with "2" times these 2 checks: > FAIL: gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c scan-tree-dump-times vect > "vect_recog_dot_prod_pattern: detected" 1 > FAIL: gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c scan-tree-dump-times vect > "vect_recog_widen_mult_pattern: detected" 1 juzhe.zh...@rivai.ai From: Richard Biener Date: 2023-08-30 15:45 To: juzhe.zh...@rivai.ai CC: gcc; Robin Dapp Subject: Re: Question about wrapv-vect-reduc-dot-s8b.c On Wed, 30 Aug 2023, juzhe.zh...@rivai.ai wrote: > Hi, I start to enable "vect" testsuite for RISC-V. > > I have a question when analyzing the 'wrapv-vect-reduc-dot-s8b.c' > It failed at: > FAIL: gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c scan-tree-dump-times vect > "vect_recog_dot_prod_pattern: detected" 1 > FAIL: gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c scan-tree-dump-times vect > "vect_recog_widen_mult_pattern: detected" 1 > > They are found "2" times. > > Since at the first time, it failed at the vectorization of conversion: > > wrapv-vect-reduc-dot-s8b.c:29:14: missed: conversion not supported by > target. > wrapv-vect-reduc-dot-s8b.c:29:14: note: vect_is_simple_use: operand > X[i_14], type of def: internal > wrapv-vect-reduc-dot-s8b.c:29:14: note: vect_is_simple_use: vectype > vector([16,16]) signed char > wrapv-vect-reduc-dot-s8b.c:29:14: note: vect_is_simple_use: operand > X[i_14], type of def: internal > wrapv-vect-reduc-dot-s8b.c:29:14: note: vect_is_simple_use: vectype > vector([16,16]) signed char > wrapv-vect-reduc-dot-s8b.c:30:17: missed: not vectorized: relevant stmt not > supported: _2 = (short int) _1; > wrapv-vect-reduc-dot-s8b.c:29:14: missed: bad operation or unsupported loop > bound. > > Here loop vectorizer is trying to do the conversion from char -> short with > both same nunits. > But we don't support 'vec_unpack' stuff in RISC-V backend since I don't see > the case that vec_unpack can optimize the codegen of autovectorizatio for RVV. > > To fix it, is it necessary to support 'vec_unpack' ? both same units would be sext, not vec_unpacks_{lo,hi} - the vectorizer ties its hands by choosing vector types early and based on the number of incoming/outgoing vectors it chooses one or the other method. More precise dumping would probably help here but somewhere earlier you should be able to see the vector type used for _2 Richard.
Re: Re: Question about wrapv-vect-reduc-dot-s8b.c
I am wondering whether we do have some situations that vec_pack/vec_unpack/vec_widen_xxx/dot_prod pattern can be beneficial for RVV ? I have ever met some situation that vec_unpack can be beneficial when working on SELECT_VL but I don't which case.... juzhe.zh...@rivai.ai From: Robin Dapp Date: 2023-08-30 16:06 To: Richard Biener; juzhe.zh...@rivai.ai CC: rdapp.gcc; gcc Subject: Re: Question about wrapv-vect-reduc-dot-s8b.c >> To fix it, is it necessary to support 'vec_unpack' ? > > both same units would be sext, not vec_unpacks_{lo,hi} - the vectorizer > ties its hands by choosing vector types early and based on the number > of incoming/outgoing vectors it chooses one or the other method. > > More precise dumping would probably help here but somewhere earlier you > should be able to see the vector type used for _2 We usually try with a "normal" mode like VNx4SI (RVVM1SI or so) and then switch to VNx4QI (i.e. a mode that only determines the number of units/elements) and have vectorize_related_mode return modes with the same number of units. This will then result in the sext/zext patterns matching. The first round where we try the normal mode will not match those because the related mode has a different number of units. So it's somewhat expected that the first try fails. My dump shows that we vectorize, so IMHO no problem. I can take a look at this but it doesn't look like a case for pack/unpack. Regards Robin
Question about dynamic choosing vectorization factor for RVV
Hi, Richard and Richi. Currently, we are statically returning vectorization factor in 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE' according to compile option. For example: void foo (int32_t *__restrict a, int32_t *__restrict b, int n) { for (int i = 0; i < n; i++) a[i] = a[i] + b[i]; } with --param=riscv-autovec-lmul = m1: vsetvli a5,a2,e32,m1,ta,ma vle32.v v2,0(a0) vle32.v v1,0(a1) vsetvli a6,zero,e32,m1,ta,ma slli a3,a5,2 vadd.vv v1,v1,v2 sub a2,a2,a5 vsetvli zero,a5,e32,m1,ta,ma vse32.v v1,0(a4) add a0,a0,a3 add a1,a1,a3 add a4,a4,a3 bne a2,zero,.L3 The 'vadd.vv' is only performing operations on a single register. with --param=riscv-autovec-lmul=m8: vsetvli a5,a2,e8,m2,ta,ma vle32.v v16,0(a0) vle32.v v8,0(a1) vsetvli a6,zero,e32,m8,ta,ma slli a3,a5,2 vadd.vv v8,v8,v16 vsetvli zero,a2,e32,m8,ta,ma sub a2,a2,a5 vse32.v v8,0(a4) add a0,a0,a3 add a1,a1,a3 add a4,a4,a3 bne a2,zero,.L3 The 'vadd.vv' here is performing operations on 8 consecutive registers: vadd.vv [v8 - v15], [v8 - v15], [v16 - v23] Users statically set the vectorization factor is not ideal. We want GCC to dynamic choose vectorization factor to do the auto-vectorization according to loop analysis. Currently, I have implement simplistic loop analysis like analyze live range of each local decl of current function. Here is the analysis, we have 32 vector registers for RVV. So we calculate the live range of current function local decl: the number of decls live at the same time * LMUL <= 32. According to this analysis, I set the vectorization factor in TARGET_VECTORIZE_PREFERRED_SIMD_MODE Then this simplistic algorithm (implemented in RISC-V backend) work well for the testcases I produces. However, I can only choose optimal vectorization for whole function but failed to specific loop. Here is the example: void foo2 (int32_t *__restrict a, int32_t *__restrict b, int32_t *__restrict c, int32_t *__restrict a2, int32_t *__restrict b2, int32_t *__restrict c2, int32_t *__restrict a3, int32_t *__restrict b3, int32_t *__restrict c3, int32_t *__restrict a4, int32_t *__restrict b4, int32_t *__restrict c4, int32_t *__restrict a5, int32_t *__restrict b5, int32_t *__restrict c5, int n) { // Loop 1 for (int i = 0; i < n; i++) a[i] = a[i] + b[i]; // Loop 2 for (int i = 0; i < n; i++){ a[i] = b[i] + c[i]; a2[i] = b2[i] + c2[i]; a3[i] = b3[i] + c3[i]; a4[i] = b4[i] + c4[i]; a5[i] = a[i] + a4[i]; a[i] = a3[i] + a2[i]+ a5[i]; } } Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 4 (since LMUL = 8 will cause vector register spillings). If I split loop 1 and loop 2 into 2 separate functions, my algorithm works well. However, if we put these 2 loop in the same function, I finally pick LMUL = 4 for both loop 1 and loop 2 since as I said above, I do the analysis base on function not base on the loop. I am struggling whether we could have a good idea for such issue. Can we pass through loop_vec_info to 'preferred_simd_mode' target hook? Thanks. juzhe.zh...@rivai.ai
Re: Re: Question about dynamic choosing vectorization factor for RVV
Thanks Richi. I am trying to figure out how to adjust finish_cost to lower the LMUL For example: void foo (int32_t *__restrict a, int32_t *__restrict b, int n) { for (int i = 0; i < n; i++) a[i] = a[i] + b[i]; } preferred_simd_mode pick LMUL = 8 (RVVM8SImode) Is is possible that we can adjust the COST in finish cost make Loop vectorizer pick LMUL = 4? I am experimenting with this following cost: if (loop_vinfo) { if (loop_vinfo->vector_mode == RVVM8SImode) { m_costs[vect_prologue] = 2; m_costs[vect_body] = 20; m_costs[vect_epilogue] = 2; } else { m_costs[vect_prologue] = 1; m_costs[vect_body] = 1; m_costs[vect_epilogue] = 1; } } I increase LMUL = 8 cost. The codegen is odd: foo: ble a2,zero,.L12 addiw a5,a2,-1 li a4,30 sext.w t1,a2 bleu a5,a4,.L7 srliw a7,t1,5 slli a7,a7,7 li a4,32 add a7,a7,a0 mv a5,a0 mv a3,a1 vsetvli zero,a4,e32,m8,ta,ma .L4: vle32.v v8,0(a5) vle32.v v16,0(a3) vadd.vv v8,v8,v16 vse32.v v8,0(a5) addi a5,a5,128 addi a3,a3,128 bne a5,a7,.L4 andi a2,a2,-32 beq t1,a2,.L14 .L3: slli a4,a2,32 subw a5,t1,a2 srli a4,a4,32 slli a5,a5,32 slli a4,a4,2 srli a5,a5,32 add a0,a0,a4 add a1,a1,a4 vsetvli a4,a5,e8,m1,ta,ma vle32.v v8,0(a0) vle32.v v4,0(a1) vsetvli a2,zero,e32,m4,ta,ma vadd.vv v4,v4,v8 vsetvli zero,a5,e32,m4,ta,ma vse32.v v4,0(a0) sub a3,a5,a4 beq a5,a4,.L12 slli a4,a4,2 vsetvli zero,a3,e8,m1,ta,ma add a0,a0,a4 add a1,a1,a4 vle32.v v4,0(a0) vle32.v v8,0(a1) vsetvli a2,zero,e32,m4,ta,ma vadd.vv v4,v4,v8 vsetvli zero,a3,e32,m4,ta,ma vse32.v v4,0(a0) .L12: ret .L7: li a2,0 j .L3 .L14: ret I hope it can generate the code like this: foo: ble a2,zero,.L5 mv a4,a0 .L3: vsetvli a5,a2,e32,m4,ta,ma vle32.v v8,0(a0) vle32.v v4,0(a1) vsetvli a6,zero,e32,m4,ta,ma slli a3,a5,2 vadd.vv v4,v4,v8 sub a2,a2,a5 vsetvli zero,a5,e32,m4,ta,ma vse32.v v4,0(a4) add a0,a0,a3 add a1,a1,a3 add a4,a4,a3 bne a2,zero,.L3 .L5: ret I am experimenting whether we can adjust cost statically to make loop vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8. If we can do that, I think we can apply analysis and then adjust the cost according to analysis. Thanks. juzhe.zh...@rivai.ai From: Richard Biener Date: 2023-08-31 15:38 To: juzhe.zh...@rivai.ai CC: gcc; richard.sandiford Subject: Re: Question about dynamic choosing vectorization factor for RVV On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote: > Hi, Richard and Richi. > > Currently, we are statically returning vectorization factor in > 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE' > according to compile option. > > For example: > void > foo (int32_t *__restrict a, int32_t *__restrict b, int n) > { > for (int i = 0; i < n; i++) > a[i] = a[i] + b[i]; > } > > with --param=riscv-autovec-lmul = m1: > > vsetvli a5,a2,e32,m1,ta,ma > vle32.v v2,0(a0) > vle32.v v1,0(a1) > vsetvli a6,zero,e32,m1,ta,ma > slli a3,a5,2 > vadd.vv v1,v1,v2 > sub a2,a2,a5 > vsetvli zero,a5,e32,m1,ta,ma > vse32.v v1,0(a4) > add a0,a0,a3 > add a1,a1,a3 > add a4,a4,a3 > bne a2,zero,.L3 > > The 'vadd.vv' is only performing operations on a single register. > > with --param=riscv-autovec-lmul=m8: > > vsetvli a5,a2,e8,m2,ta,ma > vle32.v v16,0(a0) > vle32.v v8,0(a1) > vsetvli a6,zero,e32,m8,ta,ma > slli a3,a5,2 > vadd.vv v8,v8,v16 > vsetvli zero,a2,e32,m8,ta,ma > sub a2,a2,a5 > vse32.v v8,0(a4) > add a0,a0,a3 > add a1,a1,a3 > add a4,a4,a3 > bne a2,zero,.L3 > > The 'vadd.vv' here is performing operations on 8 consecutive registers: > > vadd.vv [v8 - v15], [v8 - v15], [v16 - v23] > > Users statically set the vectorization factor is not ideal. > > We want GCC to dynamic choose vectorization factor to do the > auto-vectorization according to loop analysis. > > Currently, I have implement simplistic loop analysis like analyze live range > of each local decl of current function. > > Here is the analysis, we have 32 vector registers for RVV. > So we calculate the live range of current function local decl: > > the number of decls live at the same time * LMUL <= 32. > According to this analysis, I set the vectorization factor in > TARGET_VECTORIZE_PREFERRED_SIMD_MODE > > Then this simplistic algorithm (implemented in RISC-V backend) work well for > the testcases I produces. > > However, I can only choose optimal vectorization for whole function but > failed to specific loop. > > Here is the example: > > void foo2 (int32_t *__restrict a, > int32_t *__restrict b, > int32_t *__restrict c, > int32_t *__restrict a2, > int32_t *__restrict b2, > int32_t *__restrict c2, > int32_t
Re: Re: Question about dynamic choosing vectorization factor for RVV
Hi. Thanks Richard and Richi. Now, I figure out how to choose smaller LMUL now. void costs::finish_cost (const vector_costs *scalar_costs) { loop_vec_info loop_vinfo = dyn_cast (m_vinfo); if (loop_vinfo) { if (loop_vinfo->vector_mode == RVVM8SImode || riscv_v_ext_vls_mode_p (loop_vinfo->vector_mode)) { m_costs[vect_prologue] = 8; m_costs[vect_body] = 8; m_costs[vect_epilogue] = 8; } else { m_costs[vect_prologue] = 1; m_costs[vect_body] = 1; m_costs[vect_epilogue] = 1; } } // m_suggested_unroll_factor = 2; vector_costs::finish_cost (scalar_costs); } Previous odd codes are because of VLS modes Now, I can get the LMUL = 4 by adjusting cost. vsetvli a5,a2,e32,m4,ta,ma vle32.v v8,0(a0) vle32.v v4,0(a1) vsetvli a6,zero,e32,m4,ta,ma slli a3,a5,2 vadd.vv v4,v4,v8 sub a2,a2,a5 vsetvli zero,a5,e32,m4,ta,ma vse32.v v4,0(a4) add a0,a0,a3 add a1,a1,a3 add a4,a4,a3 bne a2,zero,.L3 Fantastic architecture of GCC Vector Cost model! Thanks a lot. juzhe.zh...@rivai.ai From: Richard Biener Date: 2023-08-31 19:20 To: juzhe.zh...@rivai.ai CC: gcc; richard.sandiford Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote: > Thanks Richi. > > I am trying to figure out how to adjust finish_cost to lower the LMUL > > For example: > > void > foo (int32_t *__restrict a, int32_t *__restrict b, int n) > { > for (int i = 0; i < n; i++) > a[i] = a[i] + b[i]; > } > > preferred_simd_mode pick LMUL = 8 (RVVM8SImode) > > Is is possible that we can adjust the COST in finish cost make Loop > vectorizer pick LMUL = 4? I see you have a autovectorize_vector_modes hook and you use VECT_COMPARE_COSTS. So the appropriate place would be to amend your vector_costs::better_main_loop_than_p. > I am experimenting with this following cost: > > if (loop_vinfo) > { > if (loop_vinfo->vector_mode == RVVM8SImode) > { > m_costs[vect_prologue] = 2; > m_costs[vect_body] = 20; > m_costs[vect_epilogue] = 2; > } > else > { > m_costs[vect_prologue] = 1; > m_costs[vect_body] = 1; > m_costs[vect_epilogue] = 1; > } > } > > I increase LMUL = 8 cost. The codegen is odd: > > foo: > ble a2,zero,.L12 > addiw a5,a2,-1 > li a4,30 > sext.w t1,a2 > bleu a5,a4,.L7 > srliw a7,t1,5 > slli a7,a7,7 > li a4,32 > add a7,a7,a0 > mv a5,a0 > mv a3,a1 > vsetvli zero,a4,e32,m8,ta,ma > .L4: > vle32.v v8,0(a5) > vle32.v v16,0(a3) > vadd.vv v8,v8,v16 > vse32.v v8,0(a5) > addi a5,a5,128 > addi a3,a3,128 > bne a5,a7,.L4 > andi a2,a2,-32 > beq t1,a2,.L14 > .L3: > slli a4,a2,32 > subw a5,t1,a2 > srli a4,a4,32 > slli a5,a5,32 > slli a4,a4,2 > srli a5,a5,32 > add a0,a0,a4 > add a1,a1,a4 > vsetvli a4,a5,e8,m1,ta,ma > vle32.v v8,0(a0) > vle32.v v4,0(a1) > vsetvli a2,zero,e32,m4,ta,ma > vadd.vv v4,v4,v8 > vsetvli zero,a5,e32,m4,ta,ma > vse32.v v4,0(a0) > sub a3,a5,a4 > beq a5,a4,.L12 > slli a4,a4,2 > vsetvli zero,a3,e8,m1,ta,ma > add a0,a0,a4 > add a1,a1,a4 > vle32.v v4,0(a0) > vle32.v v8,0(a1) > vsetvli a2,zero,e32,m4,ta,ma > vadd.vv v4,v4,v8 > vsetvli zero,a3,e32,m4,ta,ma > vse32.v v4,0(a0) > .L12: > ret > .L7: > li a2,0 > j .L3 > .L14: > ret > > I hope it can generate the code like this: > > foo: > ble a2,zero,.L5 > mv a4,a0 > .L3: > vsetvli a5,a2,e32,m4,ta,ma > vle32.v v8,0(a0) > vle32.v v4,0(a1) > vsetvli a6,zero,e32,m4,ta,ma > slli a3,a5,2 > vadd.vv v4,v4,v8 > sub a2,a2,a5 > vsetvli zero,a5,e32,m4,ta,ma > vse32.v v4,0(a4) > add a0,a0,a3 > add a1,a1,a3 > add a4,a4,a3 > bne a2,zero,.L3 > .L5: > ret > > I am experimenting whether we can adjust cost statically to make loop > vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8. > If we can do that, I think we can apply analysis and then adjust the > cost according to analysis. > > Thanks. > > > juzhe.zh...@rivai.ai > > From: Richard Biener > Date: 2023-08-31 15:38 > To: juzhe.zh...@rivai.ai > CC: gcc; richard.sandiford > Subject: Re: Question about dynamic choosing vectorization factor for RVV > On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote: > > > Hi, Richard and Richi. > > > > Currently, we are statically returning vectorization factor in > > 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE' > > according to compile option. > > > > For example: > > void > > foo (int32_t *__restrict a, int32_t *__restrict b, in
Re: Re: Question about dynamic choosing vectorization factor for RVV
Hi, Richi. >> I don't think that's "good" use of the API. You mean I should use 'better_main_loop_than_p‘ ? Yes. I plan to use it. Thanks. juzhe.zh...@rivai.ai From: Richard Biener Date: 2023-08-31 19:29 To: juzhe.zh...@rivai.ai CC: gcc; richard.sandiford Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote: > Hi. Thanks Richard and Richi. > > Now, I figure out how to choose smaller LMUL now. > > void > costs::finish_cost (const vector_costs *scalar_costs) > { > loop_vec_info loop_vinfo = dyn_cast (m_vinfo); > if (loop_vinfo) > { > if (loop_vinfo->vector_mode == RVVM8SImode > || riscv_v_ext_vls_mode_p (loop_vinfo->vector_mode)) > { > m_costs[vect_prologue] = 8; > m_costs[vect_body] = 8; > m_costs[vect_epilogue] = 8; > } > else > { > m_costs[vect_prologue] = 1; > m_costs[vect_body] = 1; > m_costs[vect_epilogue] = 1; > } > } >// m_suggested_unroll_factor = 2; > vector_costs::finish_cost (scalar_costs); > } I don't think that's "good" use of the API. > Previous odd codes are because of VLS modes > > Now, I can get the LMUL = 4 by adjusting cost. > vsetvli a5,a2,e32,m4,ta,ma > vle32.v v8,0(a0) > vle32.v v4,0(a1) > vsetvli a6,zero,e32,m4,ta,ma > slli a3,a5,2 > vadd.vv v4,v4,v8 > sub a2,a2,a5 > vsetvli zero,a5,e32,m4,ta,ma > vse32.v v4,0(a4) > add a0,a0,a3 > add a1,a1,a3 > add a4,a4,a3 > bne a2,zero,.L3 > > Fantastic architecture of GCC Vector Cost model! > > Thanks a lot. > > > juzhe.zh...@rivai.ai > > From: Richard Biener > Date: 2023-08-31 19:20 > To: juzhe.zh...@rivai.ai > CC: gcc; richard.sandiford > Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV > On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote: > > > Thanks Richi. > > > > I am trying to figure out how to adjust finish_cost to lower the LMUL > > > > For example: > > > > void > > foo (int32_t *__restrict a, int32_t *__restrict b, int n) > > { > > for (int i = 0; i < n; i++) > > a[i] = a[i] + b[i]; > > } > > > > preferred_simd_mode pick LMUL = 8 (RVVM8SImode) > > > > Is is possible that we can adjust the COST in finish cost make Loop > > vectorizer pick LMUL = 4? > > I see you have a autovectorize_vector_modes hook and you use > VECT_COMPARE_COSTS. So the appropriate place would be to > amend your vector_costs::better_main_loop_than_p. > > > I am experimenting with this following cost: > > > > if (loop_vinfo) > > { > > if (loop_vinfo->vector_mode == RVVM8SImode) > > { > > m_costs[vect_prologue] = 2; > > m_costs[vect_body] = 20; > > m_costs[vect_epilogue] = 2; > > } > > else > > { > > m_costs[vect_prologue] = 1; > > m_costs[vect_body] = 1; > > m_costs[vect_epilogue] = 1; > > } > > } > > > > I increase LMUL = 8 cost. The codegen is odd: > > > > foo: > > ble a2,zero,.L12 > > addiw a5,a2,-1 > > li a4,30 > > sext.w t1,a2 > > bleu a5,a4,.L7 > > srliw a7,t1,5 > > slli a7,a7,7 > > li a4,32 > > add a7,a7,a0 > > mv a5,a0 > > mv a3,a1 > > vsetvli zero,a4,e32,m8,ta,ma > > .L4: > > vle32.v v8,0(a5) > > vle32.v v16,0(a3) > > vadd.vv v8,v8,v16 > > vse32.v v8,0(a5) > > addi a5,a5,128 > > addi a3,a3,128 > > bne a5,a7,.L4 > > andi a2,a2,-32 > > beq t1,a2,.L14 > > .L3: > > slli a4,a2,32 > > subw a5,t1,a2 > > srli a4,a4,32 > > slli a5,a5,32 > > slli a4,a4,2 > > srli a5,a5,32 > > add a0,a0,a4 > > add a1,a1,a4 > > vsetvli a4,a5,e8,m1,ta,ma > > vle32.v v8,0(a0) > > vle32.v v4,0(a1) > > vsetvli a2,zero,e32,m4,ta,ma > > vadd.vv v4,v4,v8 > > vsetvli zero,a5,e32,m4,ta,ma > > vse32.v v4,0(a0) > > sub a3,a5,a4 > > beq a5,a4,.L12 > > slli a4,a4,2 > > vsetvli zero,a3,e8,m1,ta,ma > > add a0,a0,a4 > > add a1,a1,a4 > > vle32.v v4,0(a0) > > vle32.v v8,0(a1) > > vsetvli a2,zero,e32,m4,ta,ma > > vadd.vv v4,v4,v8 > > vsetvli zero,a3,e32,m4,ta,ma > > vse32.v v4,0(a0) > > .L12: > > ret > > .L7: > > li a2,0 > > j .L3 > > .L14: > > ret > > > > I hope it can generate the code like this: > &
Re: Re: Question about dynamic choosing vectorization factor for RVV
Hi, Richi. /* Keep track of the VF for each mode. Initialize all to 0 which indicates a mode has not been analyzed. */ auto_vec cached_vf_per_mode; for (unsigned i = 0; i < vector_modes.length (); ++i) cached_vf_per_mode.safe_push (0); I saw codes here: the 'cached_vf_per_mode' is allocated size '8'. But for RVV, I will need to push these following modes: RVVM8QI, RVVM4QI, RVVM2QI, RVVM1QI, V128QI, V64QI, V32QI, V16QI, V8QI, V4QI, V2QI There are 11 modes. Should I increase the number from 8 to 11? Thanks. juzhe.zh...@rivai.ai From: Richard Biener Date: 2023-08-31 19:29 To: juzhe.zh...@rivai.ai CC: gcc; richard.sandiford Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote: > Hi. Thanks Richard and Richi. > > Now, I figure out how to choose smaller LMUL now. > > void > costs::finish_cost (const vector_costs *scalar_costs) > { > loop_vec_info loop_vinfo = dyn_cast (m_vinfo); > if (loop_vinfo) > { > if (loop_vinfo->vector_mode == RVVM8SImode > || riscv_v_ext_vls_mode_p (loop_vinfo->vector_mode)) > { > m_costs[vect_prologue] = 8; > m_costs[vect_body] = 8; > m_costs[vect_epilogue] = 8; > } > else > { > m_costs[vect_prologue] = 1; > m_costs[vect_body] = 1; > m_costs[vect_epilogue] = 1; > } > } >// m_suggested_unroll_factor = 2; > vector_costs::finish_cost (scalar_costs); > } I don't think that's "good" use of the API. > Previous odd codes are because of VLS modes > > Now, I can get the LMUL = 4 by adjusting cost. > vsetvli a5,a2,e32,m4,ta,ma > vle32.v v8,0(a0) > vle32.v v4,0(a1) > vsetvli a6,zero,e32,m4,ta,ma > slli a3,a5,2 > vadd.vv v4,v4,v8 > sub a2,a2,a5 > vsetvli zero,a5,e32,m4,ta,ma > vse32.v v4,0(a4) > add a0,a0,a3 > add a1,a1,a3 > add a4,a4,a3 > bne a2,zero,.L3 > > Fantastic architecture of GCC Vector Cost model! > > Thanks a lot. > > > juzhe.zh...@rivai.ai > > From: Richard Biener > Date: 2023-08-31 19:20 > To: juzhe.zh...@rivai.ai > CC: gcc; richard.sandiford > Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV > On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote: > > > Thanks Richi. > > > > I am trying to figure out how to adjust finish_cost to lower the LMUL > > > > For example: > > > > void > > foo (int32_t *__restrict a, int32_t *__restrict b, int n) > > { > > for (int i = 0; i < n; i++) > > a[i] = a[i] + b[i]; > > } > > > > preferred_simd_mode pick LMUL = 8 (RVVM8SImode) > > > > Is is possible that we can adjust the COST in finish cost make Loop > > vectorizer pick LMUL = 4? > > I see you have a autovectorize_vector_modes hook and you use > VECT_COMPARE_COSTS. So the appropriate place would be to > amend your vector_costs::better_main_loop_than_p. > > > I am experimenting with this following cost: > > > > if (loop_vinfo) > > { > > if (loop_vinfo->vector_mode == RVVM8SImode) > > { > > m_costs[vect_prologue] = 2; > > m_costs[vect_body] = 20; > > m_costs[vect_epilogue] = 2; > > } > > else > > { > > m_costs[vect_prologue] = 1; > > m_costs[vect_body] = 1; > > m_costs[vect_epilogue] = 1; > > } > > } > > > > I increase LMUL = 8 cost. The codegen is odd: > > > > foo: > > ble a2,zero,.L12 > > addiw a5,a2,-1 > > li a4,30 > > sext.w t1,a2 > > bleu a5,a4,.L7 > > srliw a7,t1,5 > > slli a7,a7,7 > > li a4,32 > > add a7,a7,a0 > > mv a5,a0 > > mv a3,a1 > > vsetvli zero,a4,e32,m8,ta,ma > > .L4: > > vle32.v v8,0(a5) > > vle32.v v16,0(a3) > > vadd.vv v8,v8,v16 > > vse32.v v8,0(a5) > > addi a5,a5,128 > > addi a3,a3,128 > > bne a5,a7,.L4 > > andi a2,a2,-32 > > beq t1,a2,.L14 > > .L3: > > slli a4,a2,32 > > subw a5,t1,a2 > > srli a4,a4,32 > > slli a5,a5,32 > > slli a4,a4,2 > > srli a5,a5,32 > > add a0,a0,a4 > > add a1,a1,a4 > > vsetvli a4,a5,e8,m1,ta,ma > > vle32.v v8,0(a0) > > vle32.v v4,0(a1) > > vsetvli a2,zero,e32,m4,ta,ma > > vadd.vv v4,v4,v8 > > vsetvli zero,a5,e32,m4,ta,ma > > vse32.v v4,0(a0) > > sub a3,a5,a4 > > beq a5,a4,.L12 > > slli a4,a4,2 > > vsetvli zero,a3,e8,m1,ta,ma > > add a0
Re: Re: Lots of FAILs in gcc.target/riscv/rvv/autovec/*
I am sure that Master GCC has much better VSETVL strategy than GCC-13. And recent evaluation on our internal hardware, shows that master GCC overall is worse than previous RVV GCC I open souce in: https://github.com/riscv-collab/riscv-gcc/tree/riscv-gcc-rvv-next (rvv-next) It's odd, since I think I have support all middle-end features of rvv-next. We are analyzing, and trying to figure out why. We must recover back the performance on GCC-14. juzhe.zh...@rivai.ai From: Maxim Blinov Date: 2023-11-08 12:31 To: Jeff Law CC: gcc; kito.cheng; juzhe.zhong Subject: Re: Lots of FAILs in gcc.target/riscv/rvv/autovec/* I see, thanks for clarifying, that makes sense. In that case, what about doing the inverse? I mean, are there unique patches in the vendor branch, and would it be useful to try to upstream them into master? My motivation is to get the best autovectorized code for RISC-V. I had a go at building the TSVC benchmark (in the llvm-test-suite[1] repository) both with the master and vendor branch gcc, and noticed that the vendor branch gcc generally beats master in generating more vector instructions. If I simply count the number of instances of each vector instruction, the average across all 36 test cases of vendor vs master gcc features the following most prominent differences: - vmv.x.s:48 vs 0 (+ 48) - vle32.v: 150 vs 50 (+ 100) - vrgather.vv:61 vs 0 (+ 61) - vslidedown.vi: 61 vs 0 (+ 61) - vse32.v: 472 vs 213 (+ 459) - vmsgtu.vi: 30 vs 0 (+ 30) - vadd.vi:80 vs 30 (+ 50) - vlm.v: 18 vs 0 (+ 18) - vsm.v: 16 vs 0 (+ 16) - vmv4r.v:21 vs 7 (+ 14) (For reference, the benchmarks are all between 20k-30k in code size. Built with `-march=rv64imafdcv -O3`.) Ofcourse that doesn't say anything about performance, but would it be possible/fair to say that the vendor branch may still be better than master for generating vectorized code for RISC-V? What's interesting is that there's very little "regression" - I saw only very few cases where the vendor branch removed a vector instruction as compared to master gcc (the most often removed instruction by the vendor branch, as compared to master, is vsetvl/vsetvli.) BR, Maxim [1]: https://github.com/llvm/llvm-test-suite/tree/main/MultiSource/Benchmarks/TSVC On Tue, 7 Nov 2023 at 15:53, Jeff Law wrote: > > > > On 11/7/23 05:50, Maxim Blinov wrote: > > Hi all, > > > > I can see about 500 failing tests on the > > vendors/riscv/gcc-13-with-riscv-opts, a mostly-full list at the bottom > > of this email. It's mostly test cases scraping for vector > > instructions. > Correct. There are generic vectorizer changes that would need to be > ported over to that branch to make those tests pass. I looked at this a > few times and ultimately gave up in the rats nest of inter-dependent > patches in the vectorizer. > > > Given the lifetime of that branch is likely nearing its end, I don't > think there's much value left in trying to port those changes over. Any > such effort would likely be better spent nailing down issues on the trunk. > > jeff
Re: Re: Loop vectorizer optimization questions
I see. Thanks Tamar. I am willing to to investigate Arm's initial patch to see what else we need in that patch. Since min/max reduction with index can improve SPEC performance, I will take a look at it in GCC-15. Thanks a lot ! juzhe.zh...@rivai.ai From: Tamar Christina Date: 2024-01-09 16:59 To: 钟居哲 CC: richard.guenther; rdapp.gcc; gcc Subject: Re: RE: Loop vectorizer optimization questions Hi, The 01/08/2024 22:46, 钟居哲 wrote: > Oh. It's nice to see you have support min/max index reduction. > > I knew your patch can handle this following: > > > int idx = ii; > int max = mm; > for (int i = 0; i < n; ++i) { > int x = a[i]; > if (max < x) { > max = x; > idx = i; > } > } > > But I wonder whether your patch can handle this: > > int idx = ii; > int max = mm; > for (int i = 0; i < n; ++i) { > int x = a[i]; > if (max <= x) { > max = x; > idx = i; > } > } > The last version of the patch we sent handled all conditionals: https://inbox.sourceware.org/gcc-patches/db9pr08mb6603dccb35007d83c6736167f5...@db9pr08mb6603.eurprd08.prod.outlook.com/ There are some additional testcases in the patch for all these as well. > Will you continue to work on min/max with index ? I don't know if I'll have the free time to do so, that's the reason I haven't resent the new one. The engineer who started it no longer works for Arm. > Or you want me to continue this work base on your patch ? > > I have an initial patch which roughly implemented LLVM's approach but turns > out Richi doesn't want me to apply LLVM's approach so your patch may be more > reasonable than LLVM's approach. > When Richi reviewed it he wasn't against the approach in the patch https://inbox.sourceware.org/gcc-patches/nycvar.yfh.7.76.2105071320170.9...@zhemvz.fhfr.qr/ but he wanted the concept of a dependent reduction to be handle more generically, so we could extend it in the future. I think, from looking at Richi's feedback is that he wants vect_recog_minmax_index_pattern to be more general. We've basically hardcoded the reduction type, but it could just be a property on STMT_VINFO. Unless I'm mistaken the patch already relies on first finding both reductions, but we immediately try to resolve the relationship using vect_recog_minmax_index_pattern. Instead I think what Richi wanted was for us to keep track of reductions that operate on the same induction variable and after we finish analysing all reductions we try to see if any reductions we kept track of can be combined. Basically just separate out the discovery and tieing of the reductions. Am I right here Richi? I think the codegen part can mostly be used as is, though we might be able to do better for VLA. So it should be fairly straight forward to go from that final patch to what Richi wants, but.. I just lack time. If you want to tackle it that would be great :) Thanks, Tamar > Thanks. > > juzhe.zh...@rivai.ai > > From: Tamar Christina<mailto:tamar.christ...@arm.com> > Date: 2024-01-09 01:50 > To: 钟居哲<mailto:juzhe.zh...@rivai.ai>; gcc<mailto:gcc@gcc.gnu.org> > CC: rdapp.gcc<mailto:rdapp@gmail.com>; > richard.guenther<mailto:richard.guent...@gmail.com> > Subject: RE: Loop vectorizer optimization questions > > > > Also, another question is that I am working on min/max reduction with > > index, I > > believe it should be in GCC-15, but I wonder > > whether I can pre-post for review in stage 4, or I should post patch > > (min/max > > reduction with index) when GCC-15 is open. > > > > FWIW, We tried to implement this 5 years ago > https://gcc.gnu.org/pipermail/gcc-patches/2019-November/534518.html > and you'll likely get the same feedback if you aren't already doing so. > > I think Richard would prefer to have a general framework these kinds of > operations. We never got around to doing so > and it's still on my list but if you're taking care of it > > Just though I'd point out the previous feedback. > > Cheers, > Tamar > > > Thanks. > > > > > > juzhe.zh...@rivai.ai --