"juzhe.zh...@rivai.ai" <juzhe.zh...@rivai.ai> writes: > Hi, Richard. Thanks for the comments. > >>> If we use SELECT_VL to refer only to the target-independent ifn, I don't >>> see why this last bit is true. > Could you give me more details and information about this since I am not sure > whether I catch up with you. > You mean the current SELECT_VL is not an appropriate IFN?
No, I meant that the comment I quoted seemed to be saying that solution 3 wasn't possible. The comment seemed to say that we would need to do solution 1. >>>Like I said in the previous message, >>>when it comes to determining the length of each control, the approach we >>>take for MIN_EXPR IVs should work for SELECT_VL IVs. The point is that, >>>in both cases, any inactive lanes are always the last lanes. >>>E.g. suppose that, for one particular iteration, SELECT_VL decides that >>>6 lanes should be active in a loop with VF==8. If there is a 2-control >>>rgroup with 4 lanes each, the first control must be 4 and the second >>>control must be 2, just as if a MIN_EXPR had decided that 6 lanes of >>>the final iteration are active. >>>What I don't understand is why this isn't also a problem with the >>>fallback MIN_EXPR approach. That is, with the same example as above, >>>but using MIN_EXPR IVs, I would have expected: >>> VF == 8 >>> 1-control rgroup "A": >>> A set by MIN_EXPR IV >>> 2-control rgroup "B1", "B2": >>> B1 = MIN (A, 4) >>> B2 = A - B1 >>>and so the vectors controlled by A, B1 and B2 would all have different >>>lengths. >>>Is the point that, when using MIN_EXPR, this only happens in the >>>final iteration? And that you use a tail/epilogue loop for that, >>>so that the main loop body operates on full vectors only? > In general, I think your description is correct and comprehensive. > I'd like to share more my understanding to make sure we are on the same page. > > Take the example as you said: > > FOR one particular iteration, SELECT_VL decides that 6 lanes should be > active in a loop with VF==8. > and 2-control rgroup with 4 lanes each > which means: > > VF = 8; > each control VF = 4; > > Total length = SELECT_VL(or MIN_EXPR) (remain, 8) > Then, IMHO, we can have 3 solutions to deduce the length of 2-control base on > current flow we already built > > Also, let me share "vsetvl" ISA spec: > ceil(AVL / 2) ≤ vl ≤ VF if VF <AVL < (2 * VF) > "vl" is the number of the elements we need to process, "avl" = the actual > number of elements we will process in the current iteration > > Solution 1: > > Total length = SELECT_VL (remain, 8) ===> suppose Total length value = 6 > > control 1 length = SELECT_VL (Total length, 4) ===> If we use "vsetvl" > intruction to get the control 1 length, > it can be 3 or 4, since RVV ISA: ceil(AVL / 2) ≤ vl ≤ VF if AVL < (2 * VF), > the outcome of SELECT_VL may be Total length / 2 = 3 > Depending on the hardware implementation of "vsetvli", Let's say some RVV CPU > likes "even distribution" the outcome = 3 > > control 2 length = Total length - control 1 length ===> 6 - 3 = 3 (if > control 1 = 3) or 6 - 4 = 2 (if control 1 = 4) . > > Since RVV ISA gives the flexible definition of "vsetvli", we will end up with > this deduction. Yeah, this one wouldn't work, for reasons discussed previously. I was thinking only about solutions 2 and 3. > Solution 2: > > Total length = SELECT_VL (remain, 8) ====> 6 > control 1 length = MIN_EXPR (Total length, 4) ====> since use MIN, so always > 4 > control 2 length = Total length - control 1 length ===> 6 - 4 = 2 > > Solution 3 (Current flow): > > Total length = MIN_EXPR (remain, 8) ====> 6 only when the remain = 6 in > tail/epilogue, otherwise, it always be 8 in loop body. > control 1 length = MIN_EXPR (Total length, 4) ====> since use MIN, so always > 4 > control 2 length = Total length - control 1 length ===> Total length - 4 > > I'd like to say these 3 solutions all work for RVV. > However, RVV length configuration unlike IBM or ARM SVE using a mask. (I > would like to say mask or length they are the same thing, use for control of > each operations). > For example, ARM SVE has 8 mask registers, whenever it generate a mask, it > can be include in use list in the instructions, since ARM SVE use encoding to > specify the mask > register. > > For example: > If we are using solution 1 in a target that control by length and length is > specified in general registers, we can simulate the codegen as below. > > max length = select_vl (vf=8) > length 1 = select_vl (vf=4) > length 2 = max length - length 1 > ... > load (...use general register which storing length 1 let's said r0, r0 is > specified in the load encoding) > ... > load (...use general register which storing length 2 let's said r1, r1 is > specified in the load encoding) > .... > > However, for RVV, we don't specify the length in the instructions encoding. > Instead, we have only one VL register, and every time we want to change the > length, we need"vsetvli" > > So for solution 1, we will have: > > max length = vsetvli (vf=8) > length 1 = vsetlvi (vf=4) > length 2 = max length = length 1 > ... > vsetvli zero, length 1 <======insert by "VSETVL" PASS of RISC-V backend > load.... > vsetvli zero, length 2 <======insert by "VSETVL" PASS of RISC-V backend > load.... > > "vsetlvi" instruction is the instruction much more expensive than the general > scalar instruction (for example "min" is much cheaper than "vsetvli"). > So I am 100% sure that solution 3 (current MIN flow in GCC) is much better > than above: > > max length = min (vf=8) ===> replaced "vsetli" by "min" > length 1 = min (vf=4) ===> replaced "vsetli" by "min" > length 2 = max length = length 1 > ... > vsetvli zero, length 1 <======insert by "VSETVL" PASS of RISC-V backend > load.... > vsetvli zero, length 2 <======insert by "VSETVL" PASS of RISC-V backend > load.... Well, it depends on *why* the loop has a 2-control rgroup. There are two possibilities: (a) The riscv target has asked the vectoriser to unroll the loop 2 times (via the unrolling hook). In this case there will be a 2-control rgroup but no 1-control rgroup. (b) The loop operates on a mixture of data element sizes and the loop operates on fully-populated vectors. In that case, there will be a 1-control rgroup for the narrowest element size and a 2-control rgroup for the next widest element size. Your example describes what would happen for (a), whereas I was thinking about (b). For (b), there would be three controls in total, even for solution 3. So there would be three vsetvlis rather than two. That matches the number of vsetvlis for solution 2. When comparing solutions 2 and 3 for case (b), is solution 3 still better? E.g. is "vsetvli zero" cheaper than "vsetvli <gpr>"? Thanks, Richard