Re: Re: [PATCH V2] VECT: Add SELECT_VL support

juzhe.zh...@rivai.ai Mon, 05 Jun 2023 00:23:59 -0700

Hi, Richard. Thanks for the comments.

>> If we use SELECT_VL to refer only to the target-independent ifn, I don't
>> see why this last bit is true.
Could you give me more details and information about this since I am not sure 
whether I catch up with you. 
You mean the current SELECT_VL is not an appropriate IFN?

>>Like I said in the previous message,
>>when it comes to determining the length of each control, the approach we
>>take for MIN_EXPR IVs should work for SELECT_VL IVs.  The point is that,
>>in both cases, any inactive lanes are always the last lanes.
>>E.g. suppose that, for one particular iteration, SELECT_VL decides that
>>6 lanes should be active in a loop with VF==8.  If there is a 2-control
>>rgroup with 4 lanes each, the first control must be 4 and the second
>>control must be 2, just as if a MIN_EXPR had decided that 6 lanes of
>>the final iteration are active.
>>What I don't understand is why this isn't also a problem with the
>>fallback MIN_EXPR approach.  That is, with the same example as above,
>>but using MIN_EXPR IVs, I would have expected:
>>  VF == 8
>>  1-control rgroup "A":
>>    A set by MIN_EXPR IV
>>  2-control rgroup "B1", "B2":
>>    B1 = MIN (A, 4)
>>    B2 = A - B1
>>and so the vectors controlled by A, B1 and B2 would all have different
>>lengths.
>>Is the point that, when using MIN_EXPR, this only happens in the
>>final iteration?  And that you use a tail/epilogue loop for that,
>>so that the main loop body operates on full vectors only?
In general, I think your description is correct and  comprehensive. 
I'd like to share more my understanding to make sure we are on the same page.

Take the example as you said:

FOR one  particular iteration, SELECT_VL decides that 6 lanes should be active 
in a loop with VF==8.
and 2-control rgroup with 4 lanes each
which means:

VF = 8;
each control VF = 4;

Total length = SELECT_VL(or MIN_EXPR) (remain, 8)
Then, IMHO, we can have 3 solutions to deduce the length of 2-control base on 
current flow we already built

Also, let me share "vsetvl" ISA spec:
ceil(AVL / 2) ≤ vl ≤ VF if  VF <AVL < (2 * VF)
"vl" is the number of the elements we need to process, "avl" = the actual 
number of elements we will process in the current iteration

Solution 1:

Total length = SELECT_VL (remain, 8) ===> suppose Total length value = 6

control 1 length = SELECT_VL (Total length, 4) ===> If we use "vsetvl" 
intruction to get the control 1 length,
 it can be 3 or 4, since RVV ISA: ceil(AVL / 2) ≤ vl ≤ VF if AVL < (2 * VF), 
the outcome of SELECT_VL may be Total length / 2 = 3
Depending on the hardware implementation of "vsetvli", Let's say some RVV CPU 
likes "even distribution" the outcome = 3

control 2 length = Total length - control 1 length  ===> 6 - 3 = 3 (if control 
1 = 3) or 6 - 4 = 2 (if control 1 = 4) .

Since RVV ISA gives the flexible definition of "vsetvli", we will end up with 
this deduction.

Solution 2:

Total length = SELECT_VL (remain, 8)  ====> 6
control 1 length = MIN_EXPR (Total length, 4)  ====> since use MIN, so always 4
control 2 length = Total length - control 1 length  ===> 6 - 4 = 2

Solution 3 (Current flow):

Total length = MIN_EXPR  (remain, 8)  ====> 6 only when the remain  = 6 in 
tail/epilogue, otherwise, it always be 8 in loop body.
control 1 length = MIN_EXPR (Total length, 4)  ====> since use MIN, so always 4
control 2 length = Total length - control 1 length  ===> Total length -  4

I'd like to say these 3 solutions all work for RVV. 
However, RVV length configuration unlike IBM or ARM SVE using a mask. (I would 
like to say mask or length they are the same thing, use for control of each 
operations).
For example, ARM SVE has 8 mask registers, whenever it generate a mask, it can 
be include in use list in the instructions, since ARM SVE use encoding to 
specify the mask
register.

For example:
If we are using solution 1 in a target that control by length and length is 
specified in general registers, we can simulate the codegen as below.

max length = select_vl (vf=8)
length 1 = select_vl (vf=4)
length 2 = max length - length 1
...
load (...use general register which storing length 1 let's said r0, r0 is 
specified in the load encoding)
...
load (...use general register which storing length 2 let's said r1, r1 is 
specified in the load encoding)
....

However, for RVV, we don't specify the length in the instructions encoding.
Instead, we have only one VL register, and every time we want to change the 
length, we need"vsetvli"

So for solution 1, we will have:

max length = vsetvli (vf=8)
length 1 = vsetlvi (vf=4)
length 2 = max length = length 1
...
vsetvli zero, length 1 <======insert by "VSETVL" PASS of RISC-V backend
load....
vsetvli zero, length 2 <======insert by "VSETVL" PASS of RISC-V backend
load....

"vsetlvi" instruction is the instruction much more expensive than the general 
scalar instruction (for example "min" is much cheaper than "vsetvli").
So I am 100% sure that solution 3 (current MIN flow in GCC) is much better than 
above:

max length = min (vf=8) ===> replaced "vsetli" by "min"
length 1 = min (vf=4) ===> replaced "vsetli" by "min"
length 2 = max length = length 1
...
vsetvli zero, length 1 <======insert by "VSETVL" PASS of RISC-V backend
load....
vsetvli zero, length 2 <======insert by "VSETVL" PASS of RISC-V backend
load....

This is much better than Solution 3 and avoid multiple switching of "VL" 
register by "vsetvli"

Ok, you may want ask if "min" is much cheaper than "vsetvli", why we need 
SELECT_VL?
The reason is I want to optimize the special case (single-rgoup), since rgroup 
is just using a single length, 
unlike multiple-rgroup control which has multiple length calculation statement:

Current flow of single-rgoup:

...
length = min (vf)
...
vsetvli zero. length <=== insert by VSETLVI PASS
load (pointer IV)
vadd.
...
pointer IV = pointer IV + VF

I want to optimize it into:

...
length = vsetvli (Vf)
... <=== not need to insert vsetvlli.
load (pointer IV)
vadd.
...
pointer IV = pointer IV + length (adjust in bytesize).

This flow is the same as RVV ISA and LLVM. 
And also base on "vsetvli" definition, we can allow "even distribution" in the 
last iterations.

Hope my description is clear, feel free to comment.
Thanks so much.

juzhe.zh...@rivai.ai

From: Richard Sandiford
Date: 2023-06-05 14:21
To: juzhe.zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH V2] VECT: Add SELECT_VL support
juzhe.zh...@rivai.ai writes:
> +  /* If we're using decrement IV approach in loop control, we can use output 
> of
> +     SELECT_VL to adjust IV of loop control and data reference when it 
> satisfies
> +     the following checks:
> +
> +     (a) SELECT_VL is supported by the target.
> +     (b) LOOP_VINFO is single-rgroup control.
> +     (c) non-SLP.
> +     (d) LOOP can not be unrolled.
> +
> +     Otherwise, we use MIN_EXPR approach.
> +
> +     1. We only apply SELECT_VL on single-rgroup since:
> +
> +     (1). Multiple-rgroup controls N vector loads/stores would need N pointer
> +   updates by variable amounts.
> +     (2). SELECT_VL allows flexible length (<=VF) in each iteration.
> +     (3). For decrement IV approach, we calculate the MAX length of the loop
> +   and then deduce the length of each control from this MAX length.
> +
> +     Base on (1), (2) and (3) situations, if we try to use SELECT_VL on
> +     multiple-rgroup control, we need to generate multiple SELECT_VL to
> +     carefully adjust length of each control.

If we use SELECT_VL to refer only to the target-independent ifn, I don't
see why this last bit is true.  Like I said in the previous message,
when it comes to determining the length of each control, the approach we
take for MIN_EXPR IVs should work for SELECT_VL IVs.  The point is that,
in both cases, any inactive lanes are always the last lanes.

E.g. suppose that, for one particular iteration, SELECT_VL decides that
6 lanes should be active in a loop with VF==8.  If there is a 2-control
rgroup with 4 lanes each, the first control must be 4 and the second
control must be 2, just as if a MIN_EXPR had decided that 6 lanes of
the final iteration are active.

I'm not saying the decision itself is wrong.  But I think the explanation
could be clearer.

> +     Such approach is very inefficient
> +     and unprofitable for targets that are using a standalone instruction
> +     to configure the length of each operation.
> +     E.g. RISC-V vector use 'vsetvl' to configure the length of each 
> operation.

What I don't understand is why this isn't also a problem with the
fallback MIN_EXPR approach.  That is, with the same example as above,
but using MIN_EXPR IVs, I would have expected:

  VF == 8

  1-control rgroup "A":
    A set by MIN_EXPR IV

  2-control rgroup "B1", "B2":
    B1 = MIN (A, 4)
    B2 = A - B1

and so the vectors controlled by A, B1 and B2 would all have different
lengths.

Is the point that, when using MIN_EXPR, this only happens in the
final iteration?  And that you use a tail/epilogue loop for that,
so that the main loop body operates on full vectors only?

Thanks,
Richard

Re: Re: [PATCH V2] VECT: Add SELECT_VL support

Reply via email to