On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote:

> Hi, Richi.
> 
>   /* Keep track of the VF for each mode.  Initialize all to 0 which indicates
>      a mode has not been analyzed.  */
>   auto_vec<poly_uint64, 8> cached_vf_per_mode;
>   for (unsigned i = 0; i < vector_modes.length (); ++i)
>     cached_vf_per_mode.safe_push (0);
> 
> I saw codes here:
> the 'cached_vf_per_mode' is allocated size '8'.
> 
> But for RVV, I will need to push these following modes:
> 
> RVVM8QI, RVVM4QI, RVVM2QI, RVVM1QI, V128QI, V64QI, V32QI, V16QI, V8QI, V4QI, 
> V2QI
> 
> There are 11 modes.
> Should I increase the number from 8 to 11?

It will just perform dynamic allocation, no need to adjust.

> Thanks.
> 
> 
> juzhe.zh...@rivai.ai
>  
> From: Richard Biener
> Date: 2023-08-31 19:29
> To: juzhe.zh...@rivai.ai
> CC: gcc; richard.sandiford
> Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV
> On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote:
>  
> > Hi. Thanks Richard and Richi.
> > 
> > Now, I figure out how to choose smaller LMUL now.
> > 
> > void
> > costs::finish_cost (const vector_costs *scalar_costs)
> > {
> >   loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo);
> >   if (loop_vinfo)
> >     {
> >       if (loop_vinfo->vector_mode == RVVM8SImode
> >       || riscv_v_ext_vls_mode_p (loop_vinfo->vector_mode))
> >         {
> >           m_costs[vect_prologue] = 8;
> >           m_costs[vect_body] = 8;
> >           m_costs[vect_epilogue] = 8;
> >         }
> >       else
> >         {
> >           m_costs[vect_prologue] = 1;
> >           m_costs[vect_body] = 1;
> >           m_costs[vect_epilogue] = 1;
> >         }
> >     }
> >    // m_suggested_unroll_factor = 2;
> >   vector_costs::finish_cost (scalar_costs);
> > }
>  
> I don't think that's "good" use of the API.
>  
> > Previous odd codes are because of VLS modes
> > 
> > Now, I can get the LMUL = 4 by adjusting cost.
> > vsetvli a5,a2,e32,m4,ta,ma
> > vle32.v v8,0(a0)
> > vle32.v v4,0(a1)
> > vsetvli a6,zero,e32,m4,ta,ma
> > slli a3,a5,2
> > vadd.vv v4,v4,v8
> > sub a2,a2,a5
> > vsetvli zero,a5,e32,m4,ta,ma
> > vse32.v v4,0(a4)
> > add a0,a0,a3
> > add a1,a1,a3
> > add a4,a4,a3
> > bne a2,zero,.L3
> > 
> > Fantastic architecture of GCC Vector Cost model!
> > 
> > Thanks a lot.
> > 
> > 
> > juzhe.zh...@rivai.ai
> >  
> > From: Richard Biener
> > Date: 2023-08-31 19:20
> > To: juzhe.zh...@rivai.ai
> > CC: gcc; richard.sandiford
> > Subject: Re: Re: Question about dynamic choosing vectorization factor for 
> > RVV
> > On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote:
> >  
> > > Thanks Richi.
> > > 
> > > I am trying to figure out how to adjust finish_cost to lower the LMUL
> > > 
> > > For example:
> > > 
> > > void
> > > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > > {
> > >   for (int i = 0; i < n; i++)
> > >     a[i] = a[i] + b[i];
> > > }
> > > 
> > > preferred_simd_mode pick LMUL = 8 (RVVM8SImode)
> > > 
> > > Is is possible that we can adjust the COST in finish cost make Loop 
> > > vectorizer pick LMUL = 4?
> >  
> > I see you have a autovectorize_vector_modes hook and you use
> > VECT_COMPARE_COSTS.  So the appropriate place would be to
> > amend your vector_costs::better_main_loop_than_p.
> >  
> > > I am experimenting with this following cost:
> > > 
> > >   if (loop_vinfo)
> > >     {
> > >       if (loop_vinfo->vector_mode == RVVM8SImode)
> > >         {
> > >           m_costs[vect_prologue] = 2;
> > >           m_costs[vect_body] = 20;
> > >           m_costs[vect_epilogue] = 2;
> > >         }
> > >       else
> > >         {
> > >           m_costs[vect_prologue] = 1;
> > >           m_costs[vect_body] = 1;
> > >           m_costs[vect_epilogue] = 1;
> > >         }
> > >     }
> > > 
> > > I increase LMUL = 8 cost. The codegen is odd:
> > > 
> > > foo:
> > > ble a2,zero,.L12
> > > addiw a5,a2,-1
> > > li a4,30
> > > sext.w t1,a2
> > > bleu a5,a4,.L7
> > > srliw a7,t1,5
> > > slli a7,a7,7
> > > li a4,32
> > > add a7,a7,a0
> > > mv a5,a0
> > > mv a3,a1
> > > vsetvli zero,a4,e32,m8,ta,ma
> > > .L4:
> > > vle32.v v8,0(a5)
> > > vle32.v v16,0(a3)
> > > vadd.vv v8,v8,v16
> > > vse32.v v8,0(a5)
> > > addi a5,a5,128
> > > addi a3,a3,128
> > > bne a5,a7,.L4
> > > andi a2,a2,-32
> > > beq t1,a2,.L14
> > > .L3:
> > > slli a4,a2,32
> > > subw a5,t1,a2
> > > srli a4,a4,32
> > > slli a5,a5,32
> > > slli a4,a4,2
> > > srli a5,a5,32
> > > add a0,a0,a4
> > > add a1,a1,a4
> > > vsetvli a4,a5,e8,m1,ta,ma
> > > vle32.v v8,0(a0)
> > > vle32.v v4,0(a1)
> > > vsetvli a2,zero,e32,m4,ta,ma
> > > vadd.vv v4,v4,v8
> > > vsetvli zero,a5,e32,m4,ta,ma
> > > vse32.v v4,0(a0)
> > > sub a3,a5,a4
> > > beq a5,a4,.L12
> > > slli a4,a4,2
> > > vsetvli zero,a3,e8,m1,ta,ma
> > > add a0,a0,a4
> > > add a1,a1,a4
> > > vle32.v v4,0(a0)
> > > vle32.v v8,0(a1)
> > > vsetvli a2,zero,e32,m4,ta,ma
> > > vadd.vv v4,v4,v8
> > > vsetvli zero,a3,e32,m4,ta,ma
> > > vse32.v v4,0(a0)
> > > .L12:
> > > ret
> > > .L7:
> > > li a2,0
> > > j .L3
> > > .L14:
> > > ret
> > > 
> > > I hope it can generate the code like this:
> > > 
> > > foo:
> > > ble a2,zero,.L5
> > > mv a4,a0
> > > .L3:
> > > vsetvli a5,a2,e32,m4,ta,ma
> > > vle32.v v8,0(a0)
> > > vle32.v v4,0(a1)
> > > vsetvli a6,zero,e32,m4,ta,ma
> > > slli a3,a5,2
> > > vadd.vv v4,v4,v8
> > > sub a2,a2,a5
> > > vsetvli zero,a5,e32,m4,ta,ma
> > > vse32.v v4,0(a4)
> > > add a0,a0,a3
> > > add a1,a1,a3
> > > add a4,a4,a3
> > > bne a2,zero,.L3
> > > .L5:
> > > ret
> > > 
> > > I am experimenting whether we can adjust cost statically to make loop 
> > > vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8. 
> > > If we can do that, I think we can apply analysis and then adjust the 
> > > cost according to analysis.
> > >
> > > Thanks.
> > > 
> > > 
> > > juzhe.zh...@rivai.ai
> > >  
> > > From: Richard Biener
> > > Date: 2023-08-31 15:38
> > > To: juzhe.zh...@rivai.ai
> > > CC: gcc; richard.sandiford
> > > Subject: Re: Question about dynamic choosing vectorization factor for RVV
> > > On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote:
> > >  
> > > > Hi, Richard and Richi.
> > > > 
> > > > Currently, we are statically returning vectorization factor in 
> > > > 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
> > > > according to compile option.
> > > > 
> > > > For example:
> > > > void
> > > > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > > > {
> > > >   for (int i = 0; i < n; i++)
> > > >     a[i] = a[i] + b[i];
> > > > }
> > > > 
> > > > with --param=riscv-autovec-lmul = m1:
> > > > 
> > > > vsetvli a5,a2,e32,m1,ta,ma
> > > > vle32.v v2,0(a0)
> > > > vle32.v v1,0(a1)
> > > > vsetvli a6,zero,e32,m1,ta,ma
> > > > slli a3,a5,2
> > > > vadd.vv v1,v1,v2
> > > > sub a2,a2,a5
> > > > vsetvli zero,a5,e32,m1,ta,ma
> > > > vse32.v v1,0(a4)
> > > > add a0,a0,a3
> > > > add a1,a1,a3
> > > > add a4,a4,a3
> > > > bne a2,zero,.L3
> > > > 
> > > > The 'vadd.vv' is only performing operations on a single register.
> > > > 
> > > > with --param=riscv-autovec-lmul=m8:
> > > > 
> > > >   vsetvli a5,a2,e8,m2,ta,ma
> > > >   vle32.v v16,0(a0)
> > > >   vle32.v v8,0(a1)
> > > >   vsetvli a6,zero,e32,m8,ta,ma
> > > >   slli a3,a5,2
> > > >   vadd.vv v8,v8,v16
> > > >   vsetvli zero,a2,e32,m8,ta,ma
> > > >   sub a2,a2,a5
> > > >   vse32.v v8,0(a4)
> > > >   add a0,a0,a3
> > > >   add a1,a1,a3
> > > >   add a4,a4,a3
> > > >   bne a2,zero,.L3
> > > > 
> > > > The 'vadd.vv' here is performing operations on 8 consecutive registers:
> > > > 
> > > > vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]
> > > > 
> > > > Users statically set the vectorization factor is not ideal.
> > > > 
> > > > We want GCC to dynamic choose vectorization factor to do the 
> > > > auto-vectorization according to loop analysis.
> > > > 
> > > > Currently, I have implement simplistic loop analysis like analyze live 
> > > > range of each local decl of current function.
> > > > 
> > > > Here is the analysis, we have 32 vector registers for RVV.
> > > > So we calculate the live range of current function local decl:
> > > > 
> > > > the number of decls live at the same time * LMUL <= 32. 
> > > > According to this analysis, I set the vectorization factor in 
> > > > TARGET_VECTORIZE_PREFERRED_SIMD_MODE
> > > > 
> > > > Then this simplistic algorithm (implemented in RISC-V backend) work 
> > > > well for the testcases I produces.
> > > > 
> > > > However, I can only choose optimal vectorization for whole function but 
> > > > failed to specific loop.
> > > > 
> > > > Here is the example:
> > > > 
> > > > void foo2 (int32_t *__restrict a,
> > > >           int32_t *__restrict b,
> > > >           int32_t *__restrict c,
> > > >           int32_t *__restrict a2,
> > > >           int32_t *__restrict b2,
> > > >           int32_t *__restrict c2,
> > > >           int32_t *__restrict a3,
> > > >           int32_t *__restrict b3,
> > > >           int32_t *__restrict c3,
> > > >           int32_t *__restrict a4,
> > > >           int32_t *__restrict b4,
> > > >           int32_t *__restrict c4,
> > > >           int32_t *__restrict a5,
> > > >           int32_t *__restrict b5,
> > > >           int32_t *__restrict c5,
> > > >           int n)
> > > > {
> > > > // Loop 1
> > > >     for (int i = 0; i < n; i++)
> > > >        a[i] = a[i] + b[i];
> > > > // Loop 2
> > > >     for (int i = 0; i < n; i++){
> > > >       a[i] = b[i] + c[i];
> > > >       a2[i] = b2[i] + c2[i];
> > > >       a3[i] = b3[i] + c3[i];
> > > >       a4[i] = b4[i] + c4[i];
> > > >       a5[i] = a[i] + a4[i];
> > > >       a[i] = a3[i] + a2[i]+ a5[i];
> > > >     }
> > > > }
> > > > 
> > > > Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose 
> > > > LMUL = 4 (since LMUL = 8 will cause vector register spillings).
> > > > 
> > > > If I split loop 1 and loop 2 into 2 separate functions, my algorithm 
> > > > works well.
> > > > 
> > > > However, if we put these 2 loop in the same function, I finally pick 
> > > > LMUL = 4 for both loop 1 and loop 2 since as I said above, I do the 
> > > > analysis base on function not base
> > > > on the loop.
> > > > 
> > > > I am struggling whether we could have a good idea for such issue. Can 
> > > > we pass through loop_vec_info
> > > > to 'preferred_simd_mode' target hook?
> > >  
> > > That's not how it's currently designed to work - there's
> > > the autovectorize_vector_modes hook where you should provide a vector
> > > of modes the vectorizer iterates over and return VECT_COMPARE_COST
> > > if you want to evaluate costs between choices.  Your analysis should
> > > then happen in the finish_cost method.
> > >  
> > > That's how it's currently designed.  It might not be optimal for
> > > compile-time reasons when there are many modes, giving the target
> > > more control (and context) might be possible.
> > >  
> > > Richard.
> > >  
> > > > Thanks.
> > > > 
> > > > 
> > > > juzhe.zh...@rivai.ai
> > > > 
> > >  
> > > 
> >  
> > 
>  
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to