Hi, Richard and Richi.

Currently, we are statically returning vectorization factor in 
'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
according to compile option.

For example:
void
foo (int32_t *__restrict a, int32_t *__restrict b, int n)
{
  for (int i = 0; i < n; i++)
    a[i] = a[i] + b[i];
}

with --param=riscv-autovec-lmul = m1:

vsetvli a5,a2,e32,m1,ta,ma
vle32.v v2,0(a0)
vle32.v v1,0(a1)
vsetvli a6,zero,e32,m1,ta,ma
slli a3,a5,2
vadd.vv v1,v1,v2
sub a2,a2,a5
vsetvli zero,a5,e32,m1,ta,ma
vse32.v v1,0(a4)
add a0,a0,a3
add a1,a1,a3
add a4,a4,a3
bne a2,zero,.L3

The 'vadd.vv' is only performing operations on a single register.

with --param=riscv-autovec-lmul=m8:

  vsetvli a5,a2,e8,m2,ta,ma
  vle32.v v16,0(a0)
  vle32.v v8,0(a1)
  vsetvli a6,zero,e32,m8,ta,ma
  slli a3,a5,2
  vadd.vv v8,v8,v16
  vsetvli zero,a2,e32,m8,ta,ma
  sub a2,a2,a5
  vse32.v v8,0(a4)
  add a0,a0,a3
  add a1,a1,a3
  add a4,a4,a3
  bne a2,zero,.L3

The 'vadd.vv' here is performing operations on 8 consecutive registers:

vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]

Users statically set the vectorization factor is not ideal.

We want GCC to dynamic choose vectorization factor to do the auto-vectorization 
according to loop analysis.

Currently, I have implement simplistic loop analysis like analyze live range of 
each local decl of current function.

Here is the analysis, we have 32 vector registers for RVV.
So we calculate the live range of current function local decl:

the number of decls live at the same time * LMUL <= 32. 
According to this analysis, I set the vectorization factor in 
TARGET_VECTORIZE_PREFERRED_SIMD_MODE

Then this simplistic algorithm (implemented in RISC-V backend) work well for 
the testcases I produces.

However, I can only choose optimal vectorization for whole function but failed 
to specific loop.

Here is the example:

void foo2 (int32_t *__restrict a,
          int32_t *__restrict b,
          int32_t *__restrict c,
          int32_t *__restrict a2,
          int32_t *__restrict b2,
          int32_t *__restrict c2,
          int32_t *__restrict a3,
          int32_t *__restrict b3,
          int32_t *__restrict c3,
          int32_t *__restrict a4,
          int32_t *__restrict b4,
          int32_t *__restrict c4,
          int32_t *__restrict a5,
          int32_t *__restrict b5,
          int32_t *__restrict c5,
          int n)
{
// Loop 1
    for (int i = 0; i < n; i++)
       a[i] = a[i] + b[i];
// Loop 2
    for (int i = 0; i < n; i++){
      a[i] = b[i] + c[i];
      a2[i] = b2[i] + c2[i];
      a3[i] = b3[i] + c3[i];
      a4[i] = b4[i] + c4[i];
      a5[i] = a[i] + a4[i];
      a[i] = a3[i] + a2[i]+ a5[i];
    }
}

Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 4 
(since LMUL = 8 will cause vector register spillings).

If I split loop 1 and loop 2 into 2 separate functions, my algorithm works well.

However, if we put these 2 loop in the same function, I finally pick LMUL = 4 
for both loop 1 and loop 2 since as I said above, I do the analysis base on 
function not base
on the loop.

I am struggling whether we could have a good idea for such issue. Can we pass 
through loop_vec_info
to 'preferred_simd_mode' target hook?

Thanks.


juzhe.zh...@rivai.ai

Reply via email to