On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote:

> Thanks Richi.
> 
> I am trying to figure out how to adjust finish_cost to lower the LMUL
> 
> For example:
> 
> void
> foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> {
>   for (int i = 0; i < n; i++)
>     a[i] = a[i] + b[i];
> }
> 
> preferred_simd_mode pick LMUL = 8 (RVVM8SImode)
> 
> Is is possible that we can adjust the COST in finish cost make Loop 
> vectorizer pick LMUL = 4?

I see you have a autovectorize_vector_modes hook and you use
VECT_COMPARE_COSTS.  So the appropriate place would be to
amend your vector_costs::better_main_loop_than_p.

> I am experimenting with this following cost:
> 
>   if (loop_vinfo)
>     {
>       if (loop_vinfo->vector_mode == RVVM8SImode)
>         {
>           m_costs[vect_prologue] = 2;
>           m_costs[vect_body] = 20;
>           m_costs[vect_epilogue] = 2;
>         }
>       else
>         {
>           m_costs[vect_prologue] = 1;
>           m_costs[vect_body] = 1;
>           m_costs[vect_epilogue] = 1;
>         }
>     }
> 
> I increase LMUL = 8 cost. The codegen is odd:
> 
> foo:
> ble a2,zero,.L12
> addiw a5,a2,-1
> li a4,30
> sext.w t1,a2
> bleu a5,a4,.L7
> srliw a7,t1,5
> slli a7,a7,7
> li a4,32
> add a7,a7,a0
> mv a5,a0
> mv a3,a1
> vsetvli zero,a4,e32,m8,ta,ma
> .L4:
> vle32.v v8,0(a5)
> vle32.v v16,0(a3)
> vadd.vv v8,v8,v16
> vse32.v v8,0(a5)
> addi a5,a5,128
> addi a3,a3,128
> bne a5,a7,.L4
> andi a2,a2,-32
> beq t1,a2,.L14
> .L3:
> slli a4,a2,32
> subw a5,t1,a2
> srli a4,a4,32
> slli a5,a5,32
> slli a4,a4,2
> srli a5,a5,32
> add a0,a0,a4
> add a1,a1,a4
> vsetvli a4,a5,e8,m1,ta,ma
> vle32.v v8,0(a0)
> vle32.v v4,0(a1)
> vsetvli a2,zero,e32,m4,ta,ma
> vadd.vv v4,v4,v8
> vsetvli zero,a5,e32,m4,ta,ma
> vse32.v v4,0(a0)
> sub a3,a5,a4
> beq a5,a4,.L12
> slli a4,a4,2
> vsetvli zero,a3,e8,m1,ta,ma
> add a0,a0,a4
> add a1,a1,a4
> vle32.v v4,0(a0)
> vle32.v v8,0(a1)
> vsetvli a2,zero,e32,m4,ta,ma
> vadd.vv v4,v4,v8
> vsetvli zero,a3,e32,m4,ta,ma
> vse32.v v4,0(a0)
> .L12:
> ret
> .L7:
> li a2,0
> j .L3
> .L14:
> ret
> 
> I hope it can generate the code like this:
> 
> foo:
> ble a2,zero,.L5
> mv a4,a0
> .L3:
> vsetvli a5,a2,e32,m4,ta,ma
> vle32.v v8,0(a0)
> vle32.v v4,0(a1)
> vsetvli a6,zero,e32,m4,ta,ma
> slli a3,a5,2
> vadd.vv v4,v4,v8
> sub a2,a2,a5
> vsetvli zero,a5,e32,m4,ta,ma
> vse32.v v4,0(a4)
> add a0,a0,a3
> add a1,a1,a3
> add a4,a4,a3
> bne a2,zero,.L3
> .L5:
> ret
> 
> I am experimenting whether we can adjust cost statically to make loop 
> vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8. 
> If we can do that, I think we can apply analysis and then adjust the 
> cost according to analysis.
>
> Thanks.
> 
> 
> juzhe.zh...@rivai.ai
>  
> From: Richard Biener
> Date: 2023-08-31 15:38
> To: juzhe.zh...@rivai.ai
> CC: gcc; richard.sandiford
> Subject: Re: Question about dynamic choosing vectorization factor for RVV
> On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote:
>  
> > Hi, Richard and Richi.
> > 
> > Currently, we are statically returning vectorization factor in 
> > 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
> > according to compile option.
> > 
> > For example:
> > void
> > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > {
> >   for (int i = 0; i < n; i++)
> >     a[i] = a[i] + b[i];
> > }
> > 
> > with --param=riscv-autovec-lmul = m1:
> > 
> > vsetvli a5,a2,e32,m1,ta,ma
> > vle32.v v2,0(a0)
> > vle32.v v1,0(a1)
> > vsetvli a6,zero,e32,m1,ta,ma
> > slli a3,a5,2
> > vadd.vv v1,v1,v2
> > sub a2,a2,a5
> > vsetvli zero,a5,e32,m1,ta,ma
> > vse32.v v1,0(a4)
> > add a0,a0,a3
> > add a1,a1,a3
> > add a4,a4,a3
> > bne a2,zero,.L3
> > 
> > The 'vadd.vv' is only performing operations on a single register.
> > 
> > with --param=riscv-autovec-lmul=m8:
> > 
> >   vsetvli a5,a2,e8,m2,ta,ma
> >   vle32.v v16,0(a0)
> >   vle32.v v8,0(a1)
> >   vsetvli a6,zero,e32,m8,ta,ma
> >   slli a3,a5,2
> >   vadd.vv v8,v8,v16
> >   vsetvli zero,a2,e32,m8,ta,ma
> >   sub a2,a2,a5
> >   vse32.v v8,0(a4)
> >   add a0,a0,a3
> >   add a1,a1,a3
> >   add a4,a4,a3
> >   bne a2,zero,.L3
> > 
> > The 'vadd.vv' here is performing operations on 8 consecutive registers:
> > 
> > vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]
> > 
> > Users statically set the vectorization factor is not ideal.
> > 
> > We want GCC to dynamic choose vectorization factor to do the 
> > auto-vectorization according to loop analysis.
> > 
> > Currently, I have implement simplistic loop analysis like analyze live 
> > range of each local decl of current function.
> > 
> > Here is the analysis, we have 32 vector registers for RVV.
> > So we calculate the live range of current function local decl:
> > 
> > the number of decls live at the same time * LMUL <= 32. 
> > According to this analysis, I set the vectorization factor in 
> > TARGET_VECTORIZE_PREFERRED_SIMD_MODE
> > 
> > Then this simplistic algorithm (implemented in RISC-V backend) work well 
> > for the testcases I produces.
> > 
> > However, I can only choose optimal vectorization for whole function but 
> > failed to specific loop.
> > 
> > Here is the example:
> > 
> > void foo2 (int32_t *__restrict a,
> >           int32_t *__restrict b,
> >           int32_t *__restrict c,
> >           int32_t *__restrict a2,
> >           int32_t *__restrict b2,
> >           int32_t *__restrict c2,
> >           int32_t *__restrict a3,
> >           int32_t *__restrict b3,
> >           int32_t *__restrict c3,
> >           int32_t *__restrict a4,
> >           int32_t *__restrict b4,
> >           int32_t *__restrict c4,
> >           int32_t *__restrict a5,
> >           int32_t *__restrict b5,
> >           int32_t *__restrict c5,
> >           int n)
> > {
> > // Loop 1
> >     for (int i = 0; i < n; i++)
> >        a[i] = a[i] + b[i];
> > // Loop 2
> >     for (int i = 0; i < n; i++){
> >       a[i] = b[i] + c[i];
> >       a2[i] = b2[i] + c2[i];
> >       a3[i] = b3[i] + c3[i];
> >       a4[i] = b4[i] + c4[i];
> >       a5[i] = a[i] + a4[i];
> >       a[i] = a3[i] + a2[i]+ a5[i];
> >     }
> > }
> > 
> > Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 
> > 4 (since LMUL = 8 will cause vector register spillings).
> > 
> > If I split loop 1 and loop 2 into 2 separate functions, my algorithm works 
> > well.
> > 
> > However, if we put these 2 loop in the same function, I finally pick LMUL = 
> > 4 for both loop 1 and loop 2 since as I said above, I do the analysis base 
> > on function not base
> > on the loop.
> > 
> > I am struggling whether we could have a good idea for such issue. Can we 
> > pass through loop_vec_info
> > to 'preferred_simd_mode' target hook?
>  
> That's not how it's currently designed to work - there's
> the autovectorize_vector_modes hook where you should provide a vector
> of modes the vectorizer iterates over and return VECT_COMPARE_COST
> if you want to evaluate costs between choices.  Your analysis should
> then happen in the finish_cost method.
>  
> That's how it's currently designed.  It might not be optimal for
> compile-time reasons when there are many modes, giving the target
> more control (and context) might be possible.
>  
> Richard.
>  
> > Thanks.
> > 
> > 
> > juzhe.zh...@rivai.ai
> > 
>  
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to