Can I push code to GCC?

2021-05-10 Thread juzhe.zh...@rivai.ai
Hi, I am a compiler engineer base on GCC in China. Recently, I develop a new 
pattern for vector widen multiply-accumulator(I call it widen fma) because the 
ISA in my project has the instruction (vwmacc). I develop a lot of code 
especiallly frontend (gimple&&generic) int GCC. This is useful for my project, 
but I am not sure whether it is useful for GCC overall. Can I deliver the code? 
 And maybe you guys can check my code and refine it to have a better quality. 
Thank you!  



juzhe.zh...@rivai.ai


Re: Ju-Zhe Zhong and Robin Dapp as RISC-V reviewers

2023-07-17 Thread juzhe.zh...@rivai.ai
Thanks Jeff. 
I will wait after Robin updated his MAINTAINERS (since I don't known what 
information I need put in).



juzhe.zh...@rivai.ai
 
From: Jeff Law
Date: 2023-07-18 00:54
To: GCC Development
CC: juzhe.zh...@rivai.ai; Robin Dapp
Subject: Ju-Zhe Zhong and Robin Dapp as RISC-V reviewers
I am pleased to announce that the GCC Steering Committee has appointed 
Ju-Zhe Zhong and Robin Dapp as reviewers for the RISC-V port.
 
Ju-Zhe and Robin, can you both updated your MAINTAINERS entry appropriately.
 
Thanks,
Jeff
 


Question about wrapv-vect-reduc-dot-s8b.c

2023-08-30 Thread juzhe.zh...@rivai.ai
Hi, I start to enable "vect" testsuite for RISC-V.

I have a question when analyzing the 'wrapv-vect-reduc-dot-s8b.c'
It failed at:
FAIL: gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c scan-tree-dump-times vect 
"vect_recog_dot_prod_pattern: detected" 1
FAIL: gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c scan-tree-dump-times vect 
"vect_recog_widen_mult_pattern: detected" 1

They are found "2" times.

Since at the first time, it failed at the vectorization of conversion:

wrapv-vect-reduc-dot-s8b.c:29:14: missed:   conversion not supported by target.
wrapv-vect-reduc-dot-s8b.c:29:14: note:   vect_is_simple_use: operand X[i_14], 
type of def: internal
wrapv-vect-reduc-dot-s8b.c:29:14: note:   vect_is_simple_use: vectype 
vector([16,16]) signed char
wrapv-vect-reduc-dot-s8b.c:29:14: note:   vect_is_simple_use: operand X[i_14], 
type of def: internal
wrapv-vect-reduc-dot-s8b.c:29:14: note:   vect_is_simple_use: vectype 
vector([16,16]) signed char
wrapv-vect-reduc-dot-s8b.c:30:17: missed:   not vectorized: relevant stmt not 
supported: _2 = (short int) _1;
wrapv-vect-reduc-dot-s8b.c:29:14: missed:  bad operation or unsupported loop 
bound.

Here loop vectorizer is trying to do the conversion from char -> short with 
both same nunits.
But we don't support 'vec_unpack' stuff in RISC-V backend since I don't see the 
case that vec_unpack can optimize the codegen of autovectorizatio for RVV.

To fix it, is it necessary to support 'vec_unpack' ?

Thanks.


juzhe.zh...@rivai.ai


Re: Re: Question about wrapv-vect-reduc-dot-s8b.c

2023-08-30 Thread juzhe.zh...@rivai.ai
Thanks Richi.

>> both same units would be sext, not vec_unpacks_{lo,hi} - the vectorizer
Sorry I made a mistake here. They are not the same nunits.

wrapv-vect-reduc-dot-s8b.c:29:14: note:   get vectype for scalar type:  short 
int
wrapv-vect-reduc-dot-s8b.c:29:14: note:   vectype: vector([8,8]) short int
wrapv-vect-reduc-dot-s8b.c:29:14: note:   nunits = [8,8]
wrapv-vect-reduc-dot-s8b.c:29:14: note:   ==> examining statement: _1 = X[i_14];
wrapv-vect-reduc-dot-s8b.c:29:14: note:   precomputed vectype: vector([16,16]) 
signed char
wrapv-vect-reduc-dot-s8b.c:29:14: note:   nunits = [16,16]
wrapv-vect-reduc-dot-s8b.c:29:14: note:   ==> examining statement: _2 = (short 
int) _1;
wrapv-vect-reduc-dot-s8b.c:29:14: note:   get vectype for scalar type: short int
wrapv-vect-reduc-dot-s8b.c:29:14: note:   vectype: vector([8,8]) short int
wrapv-vect-reduc-dot-s8b.c:29:14: note:   get vectype for smallest scalar type: 
signed char
wrapv-vect-reduc-dot-s8b.c:29:14: note:   nunits vectype: vector([16,16]) 
signed char
wrapv-vect-reduc-dot-s8b.c:29:14: note:   nunits = [16,16]

Turns out for _2, it picks vector([8,8]) short int and _1, it picks 
vector([16,16]) signed char
at the first time analysis.

It seems that because we don't support vec_unpacks so that the first time 
analysis failed ? 
Then we end up with "2" times these 2 checks:

> FAIL: gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c scan-tree-dump-times vect 
> "vect_recog_dot_prod_pattern: detected" 1
> FAIL: gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c scan-tree-dump-times vect 
> "vect_recog_widen_mult_pattern: detected" 1



juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-08-30 15:45
To: juzhe.zh...@rivai.ai
CC: gcc; Robin Dapp
Subject: Re: Question about wrapv-vect-reduc-dot-s8b.c
On Wed, 30 Aug 2023, juzhe.zh...@rivai.ai wrote:
 
> Hi, I start to enable "vect" testsuite for RISC-V.
> 
> I have a question when analyzing the 'wrapv-vect-reduc-dot-s8b.c'
> It failed at:
> FAIL: gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c scan-tree-dump-times vect 
> "vect_recog_dot_prod_pattern: detected" 1
> FAIL: gcc.dg/vect/wrapv-vect-reduc-dot-s8b.c scan-tree-dump-times vect 
> "vect_recog_widen_mult_pattern: detected" 1
> 
> They are found "2" times.
> 
> Since at the first time, it failed at the vectorization of conversion:
> 
> wrapv-vect-reduc-dot-s8b.c:29:14: missed:   conversion not supported by 
> target.
> wrapv-vect-reduc-dot-s8b.c:29:14: note:   vect_is_simple_use: operand 
> X[i_14], type of def: internal
> wrapv-vect-reduc-dot-s8b.c:29:14: note:   vect_is_simple_use: vectype 
> vector([16,16]) signed char
> wrapv-vect-reduc-dot-s8b.c:29:14: note:   vect_is_simple_use: operand 
> X[i_14], type of def: internal
> wrapv-vect-reduc-dot-s8b.c:29:14: note:   vect_is_simple_use: vectype 
> vector([16,16]) signed char
> wrapv-vect-reduc-dot-s8b.c:30:17: missed:   not vectorized: relevant stmt not 
> supported: _2 = (short int) _1;
> wrapv-vect-reduc-dot-s8b.c:29:14: missed:  bad operation or unsupported loop 
> bound.
> 
> Here loop vectorizer is trying to do the conversion from char -> short with 
> both same nunits.
> But we don't support 'vec_unpack' stuff in RISC-V backend since I don't see 
> the case that vec_unpack can optimize the codegen of autovectorizatio for RVV.
> 
> To fix it, is it necessary to support 'vec_unpack' ?
 
both same units would be sext, not vec_unpacks_{lo,hi} - the vectorizer
ties its hands by choosing vector types early and based on the number
of incoming/outgoing vectors it chooses one or the other method.
 
More precise dumping would probably help here but somewhere earlier you
should be able to see the vector type used for _2
 
Richard.
 


Re: Re: Question about wrapv-vect-reduc-dot-s8b.c

2023-08-30 Thread juzhe.zh...@rivai.ai
I am wondering whether we do have some situations that 
vec_pack/vec_unpack/vec_widen_xxx/dot_prod pattern can be beneficial for RVV ?
I have ever met some situation that vec_unpack can be beneficial when working 
on SELECT_VL but I don't which case....



juzhe.zh...@rivai.ai
 
From: Robin Dapp
Date: 2023-08-30 16:06
To: Richard Biener; juzhe.zh...@rivai.ai
CC: rdapp.gcc; gcc
Subject: Re: Question about wrapv-vect-reduc-dot-s8b.c
>> To fix it, is it necessary to support 'vec_unpack' ?
> 
> both same units would be sext, not vec_unpacks_{lo,hi} - the vectorizer
> ties its hands by choosing vector types early and based on the number
> of incoming/outgoing vectors it chooses one or the other method.
> 
> More precise dumping would probably help here but somewhere earlier you
> should be able to see the vector type used for _2
We usually try with a "normal" mode like VNx4SI (RVVM1SI or so) and
then switch to VNx4QI (i.e. a mode that only determines the number of
units/elements) and have vectorize_related_mode return modes with the
same number of units.  This will then result in the sext/zext patterns
matching.  The first round where we try the normal mode will not match
those because the related mode has a different number of units.
 
So it's somewhat expected that the first try fails.
 
My dump shows that we vectorize, so IMHO no problem.  I can take a look
at this but it doesn't look like a case for pack/unpack.  
 
Regards
Robin
 


Question about dynamic choosing vectorization factor for RVV

2023-08-30 Thread juzhe.zh...@rivai.ai
Hi, Richard and Richi.

Currently, we are statically returning vectorization factor in 
'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
according to compile option.

For example:
void
foo (int32_t *__restrict a, int32_t *__restrict b, int n)
{
  for (int i = 0; i < n; i++)
a[i] = a[i] + b[i];
}

with --param=riscv-autovec-lmul = m1:

vsetvli a5,a2,e32,m1,ta,ma
vle32.v v2,0(a0)
vle32.v v1,0(a1)
vsetvli a6,zero,e32,m1,ta,ma
slli a3,a5,2
vadd.vv v1,v1,v2
sub a2,a2,a5
vsetvli zero,a5,e32,m1,ta,ma
vse32.v v1,0(a4)
add a0,a0,a3
add a1,a1,a3
add a4,a4,a3
bne a2,zero,.L3

The 'vadd.vv' is only performing operations on a single register.

with --param=riscv-autovec-lmul=m8:

  vsetvli a5,a2,e8,m2,ta,ma
  vle32.v v16,0(a0)
  vle32.v v8,0(a1)
  vsetvli a6,zero,e32,m8,ta,ma
  slli a3,a5,2
  vadd.vv v8,v8,v16
  vsetvli zero,a2,e32,m8,ta,ma
  sub a2,a2,a5
  vse32.v v8,0(a4)
  add a0,a0,a3
  add a1,a1,a3
  add a4,a4,a3
  bne a2,zero,.L3

The 'vadd.vv' here is performing operations on 8 consecutive registers:

vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]

Users statically set the vectorization factor is not ideal.

We want GCC to dynamic choose vectorization factor to do the auto-vectorization 
according to loop analysis.

Currently, I have implement simplistic loop analysis like analyze live range of 
each local decl of current function.

Here is the analysis, we have 32 vector registers for RVV.
So we calculate the live range of current function local decl:

the number of decls live at the same time * LMUL <= 32. 
According to this analysis, I set the vectorization factor in 
TARGET_VECTORIZE_PREFERRED_SIMD_MODE

Then this simplistic algorithm (implemented in RISC-V backend) work well for 
the testcases I produces.

However, I can only choose optimal vectorization for whole function but failed 
to specific loop.

Here is the example:

void foo2 (int32_t *__restrict a,
  int32_t *__restrict b,
  int32_t *__restrict c,
  int32_t *__restrict a2,
  int32_t *__restrict b2,
  int32_t *__restrict c2,
  int32_t *__restrict a3,
  int32_t *__restrict b3,
  int32_t *__restrict c3,
  int32_t *__restrict a4,
  int32_t *__restrict b4,
  int32_t *__restrict c4,
  int32_t *__restrict a5,
  int32_t *__restrict b5,
  int32_t *__restrict c5,
  int n)
{
// Loop 1
for (int i = 0; i < n; i++)
   a[i] = a[i] + b[i];
// Loop 2
for (int i = 0; i < n; i++){
  a[i] = b[i] + c[i];
  a2[i] = b2[i] + c2[i];
  a3[i] = b3[i] + c3[i];
  a4[i] = b4[i] + c4[i];
  a5[i] = a[i] + a4[i];
  a[i] = a3[i] + a2[i]+ a5[i];
}
}

Loop 1 we can aggressively choose LMUL = 8, but Loop 2 should choose LMUL = 4 
(since LMUL = 8 will cause vector register spillings).

If I split loop 1 and loop 2 into 2 separate functions, my algorithm works well.

However, if we put these 2 loop in the same function, I finally pick LMUL = 4 
for both loop 1 and loop 2 since as I said above, I do the analysis base on 
function not base
on the loop.

I am struggling whether we could have a good idea for such issue. Can we pass 
through loop_vec_info
to 'preferred_simd_mode' target hook?

Thanks.


juzhe.zh...@rivai.ai


Re: Re: Question about dynamic choosing vectorization factor for RVV

2023-08-31 Thread juzhe.zh...@rivai.ai
Thanks Richi.

I am trying to figure out how to adjust finish_cost to lower the LMUL

For example:

void
foo (int32_t *__restrict a, int32_t *__restrict b, int n)
{
  for (int i = 0; i < n; i++)
a[i] = a[i] + b[i];
}

preferred_simd_mode pick LMUL = 8 (RVVM8SImode)

Is is possible that we can adjust the COST in finish cost make Loop vectorizer 
pick LMUL = 4?

I am experimenting with this following cost:

  if (loop_vinfo)
{
  if (loop_vinfo->vector_mode == RVVM8SImode)
{
  m_costs[vect_prologue] = 2;
  m_costs[vect_body] = 20;
  m_costs[vect_epilogue] = 2;
}
  else
{
  m_costs[vect_prologue] = 1;
  m_costs[vect_body] = 1;
  m_costs[vect_epilogue] = 1;
}
}

I increase LMUL = 8 cost. The codegen is odd:

foo:
ble a2,zero,.L12
addiw a5,a2,-1
li a4,30
sext.w t1,a2
bleu a5,a4,.L7
srliw a7,t1,5
slli a7,a7,7
li a4,32
add a7,a7,a0
mv a5,a0
mv a3,a1
vsetvli zero,a4,e32,m8,ta,ma
.L4:
vle32.v v8,0(a5)
vle32.v v16,0(a3)
vadd.vv v8,v8,v16
vse32.v v8,0(a5)
addi a5,a5,128
addi a3,a3,128
bne a5,a7,.L4
andi a2,a2,-32
beq t1,a2,.L14
.L3:
slli a4,a2,32
subw a5,t1,a2
srli a4,a4,32
slli a5,a5,32
slli a4,a4,2
srli a5,a5,32
add a0,a0,a4
add a1,a1,a4
vsetvli a4,a5,e8,m1,ta,ma
vle32.v v8,0(a0)
vle32.v v4,0(a1)
vsetvli a2,zero,e32,m4,ta,ma
vadd.vv v4,v4,v8
vsetvli zero,a5,e32,m4,ta,ma
vse32.v v4,0(a0)
sub a3,a5,a4
beq a5,a4,.L12
slli a4,a4,2
vsetvli zero,a3,e8,m1,ta,ma
add a0,a0,a4
add a1,a1,a4
vle32.v v4,0(a0)
vle32.v v8,0(a1)
vsetvli a2,zero,e32,m4,ta,ma
vadd.vv v4,v4,v8
vsetvli zero,a3,e32,m4,ta,ma
vse32.v v4,0(a0)
.L12:
ret
.L7:
li a2,0
j .L3
.L14:
ret

I hope it can generate the code like this:

foo:
ble a2,zero,.L5
mv a4,a0
.L3:
vsetvli a5,a2,e32,m4,ta,ma
vle32.v v8,0(a0)
vle32.v v4,0(a1)
vsetvli a6,zero,e32,m4,ta,ma
slli a3,a5,2
vadd.vv v4,v4,v8
sub a2,a2,a5
vsetvli zero,a5,e32,m4,ta,ma
vse32.v v4,0(a4)
add a0,a0,a3
add a1,a1,a3
add a4,a4,a3
bne a2,zero,.L3
.L5:
ret

I am experimenting whether we can adjust cost statically to make loop 
vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8.
If we can do that, I think we can apply analysis and then adjust the cost 
according to analysis.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-08-31 15:38
To: juzhe.zh...@rivai.ai
CC: gcc; richard.sandiford
Subject: Re: Question about dynamic choosing vectorization factor for RVV
On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote:
 
> Hi, Richard and Richi.
> 
> Currently, we are statically returning vectorization factor in 
> 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
> according to compile option.
> 
> For example:
> void
> foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> {
>   for (int i = 0; i < n; i++)
> a[i] = a[i] + b[i];
> }
> 
> with --param=riscv-autovec-lmul = m1:
> 
> vsetvli a5,a2,e32,m1,ta,ma
> vle32.v v2,0(a0)
> vle32.v v1,0(a1)
> vsetvli a6,zero,e32,m1,ta,ma
> slli a3,a5,2
> vadd.vv v1,v1,v2
> sub a2,a2,a5
> vsetvli zero,a5,e32,m1,ta,ma
> vse32.v v1,0(a4)
> add a0,a0,a3
> add a1,a1,a3
> add a4,a4,a3
> bne a2,zero,.L3
> 
> The 'vadd.vv' is only performing operations on a single register.
> 
> with --param=riscv-autovec-lmul=m8:
> 
>   vsetvli a5,a2,e8,m2,ta,ma
>   vle32.v v16,0(a0)
>   vle32.v v8,0(a1)
>   vsetvli a6,zero,e32,m8,ta,ma
>   slli a3,a5,2
>   vadd.vv v8,v8,v16
>   vsetvli zero,a2,e32,m8,ta,ma
>   sub a2,a2,a5
>   vse32.v v8,0(a4)
>   add a0,a0,a3
>   add a1,a1,a3
>   add a4,a4,a3
>   bne a2,zero,.L3
> 
> The 'vadd.vv' here is performing operations on 8 consecutive registers:
> 
> vadd.vv [v8 - v15], [v8 - v15], [v16 - v23]
> 
> Users statically set the vectorization factor is not ideal.
> 
> We want GCC to dynamic choose vectorization factor to do the 
> auto-vectorization according to loop analysis.
> 
> Currently, I have implement simplistic loop analysis like analyze live range 
> of each local decl of current function.
> 
> Here is the analysis, we have 32 vector registers for RVV.
> So we calculate the live range of current function local decl:
> 
> the number of decls live at the same time * LMUL <= 32. 
> According to this analysis, I set the vectorization factor in 
> TARGET_VECTORIZE_PREFERRED_SIMD_MODE
> 
> Then this simplistic algorithm (implemented in RISC-V backend) work well for 
> the testcases I produces.
> 
> However, I can only choose optimal vectorization for whole function but 
> failed to specific loop.
> 
> Here is the example:
> 
> void foo2 (int32_t *__restrict a,
>   int32_t *__restrict b,
>   int32_t *__restrict c,
>   int32_t *__restrict a2,
>   int32_t *__restrict b2,
>   int32_t *__restrict c2,
>   int32_t

Re: Re: Question about dynamic choosing vectorization factor for RVV

2023-08-31 Thread juzhe.zh...@rivai.ai
Hi. Thanks Richard and Richi.

Now, I figure out how to choose smaller LMUL now.

void
costs::finish_cost (const vector_costs *scalar_costs)
{
  loop_vec_info loop_vinfo = dyn_cast (m_vinfo);
  if (loop_vinfo)
{
  if (loop_vinfo->vector_mode == RVVM8SImode
  || riscv_v_ext_vls_mode_p (loop_vinfo->vector_mode))
{
  m_costs[vect_prologue] = 8;
  m_costs[vect_body] = 8;
  m_costs[vect_epilogue] = 8;
}
  else
{
  m_costs[vect_prologue] = 1;
  m_costs[vect_body] = 1;
  m_costs[vect_epilogue] = 1;
}
}
   // m_suggested_unroll_factor = 2;
  vector_costs::finish_cost (scalar_costs);
}

Previous odd codes are because of VLS modes

Now, I can get the LMUL = 4 by adjusting cost.
vsetvli a5,a2,e32,m4,ta,ma
vle32.v v8,0(a0)
vle32.v v4,0(a1)
vsetvli a6,zero,e32,m4,ta,ma
slli a3,a5,2
vadd.vv v4,v4,v8
sub a2,a2,a5
vsetvli zero,a5,e32,m4,ta,ma
vse32.v v4,0(a4)
add a0,a0,a3
add a1,a1,a3
add a4,a4,a3
bne a2,zero,.L3

Fantastic architecture of GCC Vector Cost model!

Thanks a lot.


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-08-31 19:20
To: juzhe.zh...@rivai.ai
CC: gcc; richard.sandiford
Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV
On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote:
 
> Thanks Richi.
> 
> I am trying to figure out how to adjust finish_cost to lower the LMUL
> 
> For example:
> 
> void
> foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> {
>   for (int i = 0; i < n; i++)
> a[i] = a[i] + b[i];
> }
> 
> preferred_simd_mode pick LMUL = 8 (RVVM8SImode)
> 
> Is is possible that we can adjust the COST in finish cost make Loop 
> vectorizer pick LMUL = 4?
 
I see you have a autovectorize_vector_modes hook and you use
VECT_COMPARE_COSTS.  So the appropriate place would be to
amend your vector_costs::better_main_loop_than_p.
 
> I am experimenting with this following cost:
> 
>   if (loop_vinfo)
> {
>   if (loop_vinfo->vector_mode == RVVM8SImode)
> {
>   m_costs[vect_prologue] = 2;
>   m_costs[vect_body] = 20;
>   m_costs[vect_epilogue] = 2;
> }
>   else
> {
>   m_costs[vect_prologue] = 1;
>   m_costs[vect_body] = 1;
>   m_costs[vect_epilogue] = 1;
> }
> }
> 
> I increase LMUL = 8 cost. The codegen is odd:
> 
> foo:
> ble a2,zero,.L12
> addiw a5,a2,-1
> li a4,30
> sext.w t1,a2
> bleu a5,a4,.L7
> srliw a7,t1,5
> slli a7,a7,7
> li a4,32
> add a7,a7,a0
> mv a5,a0
> mv a3,a1
> vsetvli zero,a4,e32,m8,ta,ma
> .L4:
> vle32.v v8,0(a5)
> vle32.v v16,0(a3)
> vadd.vv v8,v8,v16
> vse32.v v8,0(a5)
> addi a5,a5,128
> addi a3,a3,128
> bne a5,a7,.L4
> andi a2,a2,-32
> beq t1,a2,.L14
> .L3:
> slli a4,a2,32
> subw a5,t1,a2
> srli a4,a4,32
> slli a5,a5,32
> slli a4,a4,2
> srli a5,a5,32
> add a0,a0,a4
> add a1,a1,a4
> vsetvli a4,a5,e8,m1,ta,ma
> vle32.v v8,0(a0)
> vle32.v v4,0(a1)
> vsetvli a2,zero,e32,m4,ta,ma
> vadd.vv v4,v4,v8
> vsetvli zero,a5,e32,m4,ta,ma
> vse32.v v4,0(a0)
> sub a3,a5,a4
> beq a5,a4,.L12
> slli a4,a4,2
> vsetvli zero,a3,e8,m1,ta,ma
> add a0,a0,a4
> add a1,a1,a4
> vle32.v v4,0(a0)
> vle32.v v8,0(a1)
> vsetvli a2,zero,e32,m4,ta,ma
> vadd.vv v4,v4,v8
> vsetvli zero,a3,e32,m4,ta,ma
> vse32.v v4,0(a0)
> .L12:
> ret
> .L7:
> li a2,0
> j .L3
> .L14:
> ret
> 
> I hope it can generate the code like this:
> 
> foo:
> ble a2,zero,.L5
> mv a4,a0
> .L3:
> vsetvli a5,a2,e32,m4,ta,ma
> vle32.v v8,0(a0)
> vle32.v v4,0(a1)
> vsetvli a6,zero,e32,m4,ta,ma
> slli a3,a5,2
> vadd.vv v4,v4,v8
> sub a2,a2,a5
> vsetvli zero,a5,e32,m4,ta,ma
> vse32.v v4,0(a4)
> add a0,a0,a3
> add a1,a1,a3
> add a4,a4,a3
> bne a2,zero,.L3
> .L5:
> ret
> 
> I am experimenting whether we can adjust cost statically to make loop 
> vectorizer use LMUL = 4 even though preferred_simd_mode return LMUL = 8. 
> If we can do that, I think we can apply analysis and then adjust the 
> cost according to analysis.
>
> Thanks.
> 
> 
> juzhe.zh...@rivai.ai
>  
> From: Richard Biener
> Date: 2023-08-31 15:38
> To: juzhe.zh...@rivai.ai
> CC: gcc; richard.sandiford
> Subject: Re: Question about dynamic choosing vectorization factor for RVV
> On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote:
>  
> > Hi, Richard and Richi.
> > 
> > Currently, we are statically returning vectorization factor in 
> > 'TARGET_VECTORIZE_PREFERRED_SIMD_MODE'
> > according to compile option.
> > 
> > For example:
> > void
> > foo (int32_t *__restrict a, int32_t *__restrict b, in

Re: Re: Question about dynamic choosing vectorization factor for RVV

2023-08-31 Thread juzhe.zh...@rivai.ai
Hi, Richi.

>> I don't think that's "good" use of the API.
You mean I should use 'better_main_loop_than_p‘ ?
Yes. I plan to use it.

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-08-31 19:29
To: juzhe.zh...@rivai.ai
CC: gcc; richard.sandiford
Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV
On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote:
 
> Hi. Thanks Richard and Richi.
> 
> Now, I figure out how to choose smaller LMUL now.
> 
> void
> costs::finish_cost (const vector_costs *scalar_costs)
> {
>   loop_vec_info loop_vinfo = dyn_cast (m_vinfo);
>   if (loop_vinfo)
> {
>   if (loop_vinfo->vector_mode == RVVM8SImode
>   || riscv_v_ext_vls_mode_p (loop_vinfo->vector_mode))
> {
>   m_costs[vect_prologue] = 8;
>   m_costs[vect_body] = 8;
>   m_costs[vect_epilogue] = 8;
> }
>   else
> {
>   m_costs[vect_prologue] = 1;
>   m_costs[vect_body] = 1;
>   m_costs[vect_epilogue] = 1;
> }
> }
>// m_suggested_unroll_factor = 2;
>   vector_costs::finish_cost (scalar_costs);
> }
 
I don't think that's "good" use of the API.
 
> Previous odd codes are because of VLS modes
> 
> Now, I can get the LMUL = 4 by adjusting cost.
> vsetvli a5,a2,e32,m4,ta,ma
> vle32.v v8,0(a0)
> vle32.v v4,0(a1)
> vsetvli a6,zero,e32,m4,ta,ma
> slli a3,a5,2
> vadd.vv v4,v4,v8
> sub a2,a2,a5
> vsetvli zero,a5,e32,m4,ta,ma
> vse32.v v4,0(a4)
> add a0,a0,a3
> add a1,a1,a3
> add a4,a4,a3
> bne a2,zero,.L3
> 
> Fantastic architecture of GCC Vector Cost model!
> 
> Thanks a lot.
> 
> 
> juzhe.zh...@rivai.ai
>  
> From: Richard Biener
> Date: 2023-08-31 19:20
> To: juzhe.zh...@rivai.ai
> CC: gcc; richard.sandiford
> Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV
> On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote:
>  
> > Thanks Richi.
> > 
> > I am trying to figure out how to adjust finish_cost to lower the LMUL
> > 
> > For example:
> > 
> > void
> > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > {
> >   for (int i = 0; i < n; i++)
> > a[i] = a[i] + b[i];
> > }
> > 
> > preferred_simd_mode pick LMUL = 8 (RVVM8SImode)
> > 
> > Is is possible that we can adjust the COST in finish cost make Loop 
> > vectorizer pick LMUL = 4?
>  
> I see you have a autovectorize_vector_modes hook and you use
> VECT_COMPARE_COSTS.  So the appropriate place would be to
> amend your vector_costs::better_main_loop_than_p.
>  
> > I am experimenting with this following cost:
> > 
> >   if (loop_vinfo)
> > {
> >   if (loop_vinfo->vector_mode == RVVM8SImode)
> > {
> >   m_costs[vect_prologue] = 2;
> >   m_costs[vect_body] = 20;
> >   m_costs[vect_epilogue] = 2;
> > }
> >   else
> > {
> >   m_costs[vect_prologue] = 1;
> >   m_costs[vect_body] = 1;
> >   m_costs[vect_epilogue] = 1;
> > }
> > }
> > 
> > I increase LMUL = 8 cost. The codegen is odd:
> > 
> > foo:
> > ble a2,zero,.L12
> > addiw a5,a2,-1
> > li a4,30
> > sext.w t1,a2
> > bleu a5,a4,.L7
> > srliw a7,t1,5
> > slli a7,a7,7
> > li a4,32
> > add a7,a7,a0
> > mv a5,a0
> > mv a3,a1
> > vsetvli zero,a4,e32,m8,ta,ma
> > .L4:
> > vle32.v v8,0(a5)
> > vle32.v v16,0(a3)
> > vadd.vv v8,v8,v16
> > vse32.v v8,0(a5)
> > addi a5,a5,128
> > addi a3,a3,128
> > bne a5,a7,.L4
> > andi a2,a2,-32
> > beq t1,a2,.L14
> > .L3:
> > slli a4,a2,32
> > subw a5,t1,a2
> > srli a4,a4,32
> > slli a5,a5,32
> > slli a4,a4,2
> > srli a5,a5,32
> > add a0,a0,a4
> > add a1,a1,a4
> > vsetvli a4,a5,e8,m1,ta,ma
> > vle32.v v8,0(a0)
> > vle32.v v4,0(a1)
> > vsetvli a2,zero,e32,m4,ta,ma
> > vadd.vv v4,v4,v8
> > vsetvli zero,a5,e32,m4,ta,ma
> > vse32.v v4,0(a0)
> > sub a3,a5,a4
> > beq a5,a4,.L12
> > slli a4,a4,2
> > vsetvli zero,a3,e8,m1,ta,ma
> > add a0,a0,a4
> > add a1,a1,a4
> > vle32.v v4,0(a0)
> > vle32.v v8,0(a1)
> > vsetvli a2,zero,e32,m4,ta,ma
> > vadd.vv v4,v4,v8
> > vsetvli zero,a3,e32,m4,ta,ma
> > vse32.v v4,0(a0)
> > .L12:
> > ret
> > .L7:
> > li a2,0
> > j .L3
> > .L14:
> > ret
> > 
> > I hope it can generate the code like this:
> &

Re: Re: Question about dynamic choosing vectorization factor for RVV

2023-08-31 Thread juzhe.zh...@rivai.ai
Hi, Richi.

  /* Keep track of the VF for each mode.  Initialize all to 0 which indicates
 a mode has not been analyzed.  */
  auto_vec cached_vf_per_mode;
  for (unsigned i = 0; i < vector_modes.length (); ++i)
cached_vf_per_mode.safe_push (0);

I saw codes here:
the 'cached_vf_per_mode' is allocated size '8'.

But for RVV, I will need to push these following modes:

RVVM8QI, RVVM4QI, RVVM2QI, RVVM1QI, V128QI, V64QI, V32QI, V16QI, V8QI, V4QI, 
V2QI

There are 11 modes.
Should I increase the number from 8 to 11?

Thanks.


juzhe.zh...@rivai.ai
 
From: Richard Biener
Date: 2023-08-31 19:29
To: juzhe.zh...@rivai.ai
CC: gcc; richard.sandiford
Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV
On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote:
 
> Hi. Thanks Richard and Richi.
> 
> Now, I figure out how to choose smaller LMUL now.
> 
> void
> costs::finish_cost (const vector_costs *scalar_costs)
> {
>   loop_vec_info loop_vinfo = dyn_cast (m_vinfo);
>   if (loop_vinfo)
> {
>   if (loop_vinfo->vector_mode == RVVM8SImode
>   || riscv_v_ext_vls_mode_p (loop_vinfo->vector_mode))
> {
>   m_costs[vect_prologue] = 8;
>   m_costs[vect_body] = 8;
>   m_costs[vect_epilogue] = 8;
> }
>   else
> {
>   m_costs[vect_prologue] = 1;
>   m_costs[vect_body] = 1;
>   m_costs[vect_epilogue] = 1;
> }
> }
>// m_suggested_unroll_factor = 2;
>   vector_costs::finish_cost (scalar_costs);
> }
 
I don't think that's "good" use of the API.
 
> Previous odd codes are because of VLS modes
> 
> Now, I can get the LMUL = 4 by adjusting cost.
> vsetvli a5,a2,e32,m4,ta,ma
> vle32.v v8,0(a0)
> vle32.v v4,0(a1)
> vsetvli a6,zero,e32,m4,ta,ma
> slli a3,a5,2
> vadd.vv v4,v4,v8
> sub a2,a2,a5
> vsetvli zero,a5,e32,m4,ta,ma
> vse32.v v4,0(a4)
> add a0,a0,a3
> add a1,a1,a3
> add a4,a4,a3
> bne a2,zero,.L3
> 
> Fantastic architecture of GCC Vector Cost model!
> 
> Thanks a lot.
> 
> 
> juzhe.zh...@rivai.ai
>  
> From: Richard Biener
> Date: 2023-08-31 19:20
> To: juzhe.zh...@rivai.ai
> CC: gcc; richard.sandiford
> Subject: Re: Re: Question about dynamic choosing vectorization factor for RVV
> On Thu, 31 Aug 2023, juzhe.zh...@rivai.ai wrote:
>  
> > Thanks Richi.
> > 
> > I am trying to figure out how to adjust finish_cost to lower the LMUL
> > 
> > For example:
> > 
> > void
> > foo (int32_t *__restrict a, int32_t *__restrict b, int n)
> > {
> >   for (int i = 0; i < n; i++)
> > a[i] = a[i] + b[i];
> > }
> > 
> > preferred_simd_mode pick LMUL = 8 (RVVM8SImode)
> > 
> > Is is possible that we can adjust the COST in finish cost make Loop 
> > vectorizer pick LMUL = 4?
>  
> I see you have a autovectorize_vector_modes hook and you use
> VECT_COMPARE_COSTS.  So the appropriate place would be to
> amend your vector_costs::better_main_loop_than_p.
>  
> > I am experimenting with this following cost:
> > 
> >   if (loop_vinfo)
> > {
> >   if (loop_vinfo->vector_mode == RVVM8SImode)
> > {
> >   m_costs[vect_prologue] = 2;
> >   m_costs[vect_body] = 20;
> >   m_costs[vect_epilogue] = 2;
> > }
> >   else
> > {
> >   m_costs[vect_prologue] = 1;
> >   m_costs[vect_body] = 1;
> >   m_costs[vect_epilogue] = 1;
> > }
> > }
> > 
> > I increase LMUL = 8 cost. The codegen is odd:
> > 
> > foo:
> > ble a2,zero,.L12
> > addiw a5,a2,-1
> > li a4,30
> > sext.w t1,a2
> > bleu a5,a4,.L7
> > srliw a7,t1,5
> > slli a7,a7,7
> > li a4,32
> > add a7,a7,a0
> > mv a5,a0
> > mv a3,a1
> > vsetvli zero,a4,e32,m8,ta,ma
> > .L4:
> > vle32.v v8,0(a5)
> > vle32.v v16,0(a3)
> > vadd.vv v8,v8,v16
> > vse32.v v8,0(a5)
> > addi a5,a5,128
> > addi a3,a3,128
> > bne a5,a7,.L4
> > andi a2,a2,-32
> > beq t1,a2,.L14
> > .L3:
> > slli a4,a2,32
> > subw a5,t1,a2
> > srli a4,a4,32
> > slli a5,a5,32
> > slli a4,a4,2
> > srli a5,a5,32
> > add a0,a0,a4
> > add a1,a1,a4
> > vsetvli a4,a5,e8,m1,ta,ma
> > vle32.v v8,0(a0)
> > vle32.v v4,0(a1)
> > vsetvli a2,zero,e32,m4,ta,ma
> > vadd.vv v4,v4,v8
> > vsetvli zero,a5,e32,m4,ta,ma
> > vse32.v v4,0(a0)
> > sub a3,a5,a4
> > beq a5,a4,.L12
> > slli a4,a4,2
> > vsetvli zero,a3,e8,m1,ta,ma
> > add a0

Re: Re: Lots of FAILs in gcc.target/riscv/rvv/autovec/*

2023-11-07 Thread juzhe.zh...@rivai.ai
I am sure that Master GCC has much better VSETVL strategy than GCC-13.

And recent evaluation on our internal hardware, shows that master GCC overall 
is worse than previous RVV GCC I open souce in:
https://github.com/riscv-collab/riscv-gcc/tree/riscv-gcc-rvv-next  (rvv-next)
It's odd, since I think I have support all middle-end features of rvv-next.

We are analyzing, and trying to figure out why. We must recover back the 
performance on GCC-14.



juzhe.zh...@rivai.ai
 
From: Maxim Blinov
Date: 2023-11-08 12:31
To: Jeff Law
CC: gcc; kito.cheng; juzhe.zhong
Subject: Re: Lots of FAILs in gcc.target/riscv/rvv/autovec/*
I see, thanks for clarifying, that makes sense.
 
In that case, what about doing the inverse? I mean, are there unique
patches in the vendor branch, and would it be useful to try to
upstream them into master? My motivation is to get the best
autovectorized code for RISC-V.
 
I had a go at building the TSVC benchmark (in the llvm-test-suite[1]
repository) both with the master and vendor branch gcc, and noticed
that the vendor branch gcc generally beats master in generating more
vector instructions.
 
If I simply count the number of instances of each vector instruction,
the average across all 36 test cases of vendor vs master gcc features
the following most prominent differences:
 
- vmv.x.s:48 vs   0 (+ 48)
- vle32.v:   150 vs  50 (+ 100)
- vrgather.vv:61 vs   0 (+ 61)
- vslidedown.vi:  61 vs   0 (+ 61)
- vse32.v:   472 vs 213 (+ 459)
- vmsgtu.vi:  30 vs   0 (+ 30)
- vadd.vi:80 vs  30 (+ 50)
- vlm.v:  18 vs   0 (+ 18)
- vsm.v:  16 vs   0 (+ 16)
- vmv4r.v:21 vs   7 (+ 14)
 
(For reference, the benchmarks are all between 20k-30k in code size.
Built with `-march=rv64imafdcv -O3`.)
 
Ofcourse that doesn't say anything about performance, but would it be
possible/fair to say that the vendor branch may still be better than
master for generating vectorized code for RISC-V?
 
What's interesting is that there's very little "regression" - I saw
only very few cases where the vendor branch removed a vector
instruction as compared to master gcc (the most often removed
instruction by the vendor branch, as compared to master, is
vsetvl/vsetvli.)
 
BR,
Maxim
 
[1]: 
https://github.com/llvm/llvm-test-suite/tree/main/MultiSource/Benchmarks/TSVC
 
On Tue, 7 Nov 2023 at 15:53, Jeff Law  wrote:
>
>
>
> On 11/7/23 05:50, Maxim Blinov wrote:
> > Hi all,
> >
> > I can see about 500 failing tests on the
> > vendors/riscv/gcc-13-with-riscv-opts, a mostly-full list at the bottom
> > of this email. It's mostly test cases scraping for vector
> > instructions.
> Correct.  There are generic vectorizer changes that would need to be
> ported over to that branch to make those tests pass.  I looked at this a
> few times and ultimately gave up in the rats nest of inter-dependent
> patches in the vectorizer.
>
>
> Given the lifetime of that branch is likely nearing its end, I don't
> think there's much value left in trying to port those changes over. Any
> such effort would likely be better spent nailing down issues on the trunk.
>
> jeff
 


Re: Re: Loop vectorizer optimization questions

2024-01-09 Thread juzhe.zh...@rivai.ai
I see. Thanks Tamar.

I am willing to to investigate Arm's initial patch to see what else we need in 
that patch.

Since min/max reduction with index can improve SPEC performance, I will take a 
look at it in GCC-15.

Thanks a lot !



juzhe.zh...@rivai.ai
 
From: Tamar Christina
Date: 2024-01-09 16:59
To: 钟居哲
CC: richard.guenther; rdapp.gcc; gcc
Subject: Re: RE: Loop vectorizer optimization questions
Hi,
 
The 01/08/2024 22:46, 钟居哲 wrote:
> Oh. It's nice to see you have support min/max index reduction.
> 
> I knew your patch can handle this following:
> 
> 
> int idx = ii;
> int max = mm;
> for (int i = 0; i < n; ++i) {
>   int x = a[i];
>   if (max < x) {
> max = x;
> idx = i;
>   }
> }
> 
> But I wonder whether your patch can handle this:
> 
> int idx = ii;
> int max = mm;
> for (int i = 0; i < n; ++i) {
>   int x = a[i];
>   if (max <= x) {
> max = x;
> idx = i;
>   }
> }
> 
 
The last version of the patch we sent handled all conditionals:
 
https://inbox.sourceware.org/gcc-patches/db9pr08mb6603dccb35007d83c6736167f5...@db9pr08mb6603.eurprd08.prod.outlook.com/
 
There are some additional testcases in the patch for all these as well.
 
> Will you continue to work on min/max with index ?
 
I don't know if I'll have the free time to do so, that's the reason I haven't 
resent the new one.
The engineer who started it no longer works for Arm.
 
> Or you want me to continue this work base on your patch ?
> 
> I have an initial patch which roughly implemented LLVM's approach but turns 
> out Richi doesn't want me to apply LLVM's approach so your patch may be more 
> reasonable than LLVM's approach.
> 
 
When Richi reviewed it he wasn't against the approach in the patch 
https://inbox.sourceware.org/gcc-patches/nycvar.yfh.7.76.2105071320170.9...@zhemvz.fhfr.qr/
but he wanted the concept of a dependent reduction to be handle more 
generically, so we could extend it in the future.
 
I think, from looking at Richi's feedback is that he wants 
vect_recog_minmax_index_pattern to be more general. We've basically hardcoded 
the reduction type,
but it could just be a property on STMT_VINFO.
 
Unless I'm mistaken the patch already relies on first finding both reductions, 
but we immediately try to resolve the relationship using 
vect_recog_minmax_index_pattern.
Instead I think what Richi wanted was for us to keep track of reductions that 
operate on the same induction variable and after we finish analysing all 
reductions we
try to see if any reductions we kept track of can be combined.
 
Basically just separate out the discovery and tieing of the reductions.
 
Am I right here Richi?
 
I think the codegen part can mostly be used as is, though we might be able to 
do better for VLA.
 
So it should be fairly straight forward to go from that final patch to what 
Richi wants, but.. I just lack time.
 
If you want to tackle it that would be great :)
 
Thanks,
Tamar
 
> Thanks.
> 
> juzhe.zh...@rivai.ai
> 
> From: Tamar Christina<mailto:tamar.christ...@arm.com>
> Date: 2024-01-09 01:50
> To: 钟居哲<mailto:juzhe.zh...@rivai.ai>; gcc<mailto:gcc@gcc.gnu.org>
> CC: rdapp.gcc<mailto:rdapp@gmail.com>; 
> richard.guenther<mailto:richard.guent...@gmail.com>
> Subject: RE: Loop vectorizer optimization questions
> >
> > Also, another question is that I am working on min/max reduction with 
> > index, I
> > believe it should be in GCC-15, but I wonder
> > whether I can pre-post for review in stage 4, or I should post patch 
> > (min/max
> > reduction with index) when GCC-15 is open.
> >
> 
> FWIW, We tried to implement this 5 years ago 
> https://gcc.gnu.org/pipermail/gcc-patches/2019-November/534518.html
> and you'll likely get the same feedback if you aren't already doing so.
> 
> I think Richard would prefer to have a general framework these kinds of 
> operations.  We never got around to doing so
> and it's still on my list but if you're taking care of it 
> 
> Just though I'd point out the previous feedback.
> 
> Cheers,
> Tamar
> 
> > Thanks.
> >
> >
> > juzhe.zh...@rivai.ai
 
--