Thanks Richi.

I am trying to figure out how to adjust finish_cost to lower the LMUL

For example:

void
foo (int32_t *__restrict a, int32_t *__restrict b, int n)
{
  for (int i = 0; i < n; i++)
    a[i] = a[i] + b[i];
}

preferred_simd_mode pick LMUL = 8 (RVVM8SImode)

Is is possible that we can adjust the COST in finish cost make Loop vectorizer 
pick LMUL = 4?

I am experimenting with this following cost:

  if (loop_vinfo)
    {
      if (loop_vinfo->vector_mode == RVVM8SImode)
        {
          m_costs[vect_prologue] = 2;
          m_costs[vect_body] = 20;
          m_costs[vect_epilogue] = 2;
        }
      else
        {
          m_costs[vect_prologue] = 1;
          m_costs[vect_body] = 1;
          m_costs[vect_epilogue] = 1;
        }
    }

I increase LMUL = 8 cost. The codegen is odd:

foo:
ble a2,zero,.L12
addiw a5,a2,-1
li a4,30
sext.w t1,a2
bleu a5,a4,.L7
srliw a7,t1,5
slli a7,a7,7
li a4,32
add a7,a7,a0
mv a5,a0
mv a3,a1
vsetvli zero,a4,e32,m8,ta,ma
.L4:
vle32.v v8,0(a5)
vle32.v v16,0(a3)
vadd.vv v8,v8,v16
vse32.v v8,0(a5)
addi a5,a5,128
addi a3,a3,128
bne a5,a7,.L4
andi a2,a2,-32
beq t1,a2,.L14
.L3:
slli a4,a2,32
subw a5,t1,a2
srli a4,a4,32
slli a5,a5,32
slli a4,a4,2
srli a5,a5,32
add a0,a0,a4
add a1,a1,a4
vsetvli a4,a5,e8,m1,ta,ma
vle32.v v8,0(a0)
vle32.v v4,0(a1)
vsetvli a2,zero,e32,m4,ta,ma
vadd.vv v4,v4,v8
vsetvli zero,a5,e32,m4,ta,ma
vse32.v v4,0(a0)
sub a3,a5,a4
beq a5,a4,.L12
slli a4,a4,2
vsetvli zero,a3,e8,m1,ta,ma
add a0,a0,a4
add a1,a1,a4
vle32.v v4,0(a0)
vle32.v v8,0(a1)
vsetvli a2,zero,e32,m4,ta,ma
vadd.vv v4,v4,v8
vsetvli zero,a3,e32,m4,ta,ma
vse32.v v4,0(a0)
.L12:
ret
.L7:
li a2,0
j .L3
.L14:
ret

I hope it can generate the code like this:

foo:
ble a2,zero,.L5
mv a4,a0
.L3:
vsetvli a5,a2,e32,m4,ta,ma
vle32.v v8,0(a0)
vle32.v v4,0(a1)
vsetvli a6,zero,e32,m4,ta,ma
slli a3,a5,2
vadd.vv v4,v4,v8
sub a2,a2,a5
vsetvli zero,a5,e32,m4,ta,ma
vse32.v v4,0(a4)
add a0,a0,a3
add a1,a1,a3
add a4,a4,a3
bne a2,zero,.L3
.L5:
ret

I am experimenting whether we can adjust cost statically to make loop 
vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8.
If we can do that, I think we can apply analysis and then adjust the cost 
according to analysis.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-08-31 15:38
To: juzhe.zh...@rivai.ai
CC: gcc; richard.sandiford
Subject: Re: Question about dynamic choosing vectorization factor for RVV
On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote:
 
> Hi, Richard and Richi.
> 
> Currently, we are statically returning vectorization factor in 
> 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
> according to compile option.
> 
> For example:
> void
> foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> {
>   for (int i = 0; i < n; i++)
>     a[i] = a[i] + b[i];
> }
> 
> with --param=riscv-autovec-lmul = m1:
> 
> vsetvli a5,a2,e32,m1,ta,ma
> vle32.v v2,0(a0)
> vle32.v v1,0(a1)
> vsetvli a6,zero,e32,m1,ta,ma
> slli a3,a5,2
> vadd.vv v1,v1,v2
> sub a2,a2,a5
> vsetvli zero,a5,e32,m1,ta,ma
> vse32.v v1,0(a4)
> add a0,a0,a3
> add a1,a1,a3
> add a4,a4,a3
> bne a2,zero,.L3
> 
> The 'vadd.vv' is only performing operations on a single register.
> 
> with --param=riscv-autovec-lmul=m8:
> 
>   vsetvli a5,a2,e8,m2,ta,ma
>   vle32.v v16,0(a0)
>   vle32.v v8,0(a1)
>   vsetvli a6,zero,e32,m8,ta,ma
>   slli a3,a5,2
>   vadd.vv v8,v8,v16
>   vsetvli zero,a2,e32,m8,ta,ma
>   sub a2,a2,a5
>   vse32.v v8,0(a4)
>   add a0,a0,a3
>   add a1,a1,a3
>   add a4,a4,a3
>   bne a2,zero,.L3
> 
> The 'vadd.vv' here is performing operations on 8 consecutive registers:
> 
> vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]
> 
> Users statically set the vectorization factor is not ideal.
> 
> We want GCC to dynamic choose vectorization factor to do the 
> auto-vectorization according to loop analysis.
> 
> Currently, I have implement simplistic loop analysis like analyze live range 
> of each local decl of current function.
> 
> Here is the analysis, we have 32 vector registers for RVV.
> So we calculate the live range of current function local decl:
> 
> the number of decls live at the same time * LMUL <= 32. 
> According to this analysis, I set the vectorization factor in 
> TARGET_VECTORIZE_PREFERRED_SIMD_MODE
> 
> Then this simplistic algorithm (implemented in RISC-V backend) work well for 
> the testcases I produces.
> 
> However, I can only choose optimal vectorization for whole function but 
> failed to specific loop.
> 
> Here is the example:
> 
> void foo2 (int32_t *__restrict a,
>           int32_t *__restrict b,
>           int32_t *__restrict c,
>           int32_t *__restrict a2,
>           int32_t *__restrict b2,
>           int32_t *__restrict c2,
>           int32_t *__restrict a3,
>           int32_t *__restrict b3,
>           int32_t *__restrict c3,
>           int32_t *__restrict a4,
>           int32_t *__restrict b4,
>           int32_t *__restrict c4,
>           int32_t *__restrict a5,
>           int32_t *__restrict b5,
>           int32_t *__restrict c5,
>           int n)
> {
> // Loop 1
>     for (int i = 0; i < n; i++)
>        a[i] = a[i] + b[i];
> // Loop 2
>     for (int i = 0; i < n; i++){
>       a[i] = b[i] + c[i];
>       a2[i] = b2[i] + c2[i];
>       a3[i] = b3[i] + c3[i];
>       a4[i] = b4[i] + c4[i];
>       a5[i] = a[i] + a4[i];
>       a[i] = a3[i] + a2[i]+ a5[i];
>     }
> }
> 
> Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 4 
> (since LMUL = 8 will cause vector register spillings).
> 
> If I split loop 1 and loop 2 into 2 separate functions, my algorithm works 
> well.
> 
> However, if we put these 2 loop in the same function, I finally pick LMUL = 4 
> for both loop 1 and loop 2 since as I said above, I do the analysis base on 
> function not base
> on the loop.
> 
> I am struggling whether we could have a good idea for such issue. Can we pass 
> through loop_vec_info
> to 'preferred_simd_mode' target hook?
 
That's not how it's currently designed to work - there's
the autovectorize_vector_modes hook where you should provide a vector
of modes the vectorizer iterates over and return VECT_COMPARE_COST
if you want to evaluate costs between choices.  Your analysis should
then happen in the finish_cost method.
 
That's how it's currently designed.  It might not be optimal for
compile-time reasons when there are many modes, giving the target
more control (and context) might be possible.
 
Richard.
 
> Thanks.
> 
> 
> juzhe.zh...@rivai.ai
> 
 
-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
 

Reply via email to