Hi Juzhe, > > Hi, Tamar. > > This is an amazing auto-vectorization flow. > > I am thinking about whether RVV can also get benefits from this optimization. > IMHO, RVV should be also using this flow. > > So, to allow RVV (target uses len as loop_control and mask as flow control), > I > am not sure whether we can do this (Feel free to correct me if I am wrong): > > + if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) > + vect_record_loop_mask (loop_vinfo, masks, ncopies, truth_type, > NULL); > > Maybe it can be ? > > if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) { > if (mask_loop_p) > vect_record_loop_mask > else > vect_record_loop_len > } >
Yeah, that should be the only change required, I started this patch before the loop_len change made it in and just rebased recently 😊 > > + tree cond = gimple_assign_lhs (new_stmt); > + if (masked_loop_p) > + { > + tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies, > truth_type, 0); > + cond = prepare_vec_mask (loop_vinfo, TREE_TYPE (mask), mask, cond, > + &cond_gsi); > + } > + > + tree t = fold_build2 (NE_EXPR, boolean_type_node, cond, > + build_zero_cst (truth_type)); > > From my understanding, you are using final_mask = loop_mask (WHILE_ULT) > && control_mask (comparison). > Then Test final_mask using NE_EXPR. Am I right? Yeah that's right, It's creating the mask for partial iterations. The only other constraint is being able to reduce a boolean mask using inclusive OR, but that's optional and is only needed if one side of the comparison produces more than 1 copy (so it's only checked then). > > For RVV, I thinking whether we can have a good way to do this testing. > Not sure whether we can have something like LEN_TEST_MASK_NE (loop_len, > control_mask...) > Hmm Is just the vect_record_loop_len change not enough? (I haven't followed the masking implementation in RVV in detail) but I assume that it's following the general principle than & an operation with a mask creates a masked operation? That is to say, I thought LOOP_LEN was only for the loop control? Which doesn't change here. > I am not saying that we should support "early break" auto-vectorization for > RVV (loop_len && control_mask). > I am just write some comments trying to figure out how I can adapt your > working for RVV in the future. > Yes happy to help, the more uses it gets the more bugs I can fix 😊 Cheers, Tamar > Thanks. > > > juzhe.zh...@rivai.ai > > From: Li, Pan2 > Date: 2023-06-28 22:21 > To: juzhe.zh...@rivai.ai > Subject: FW: [PATCH v5 0/19] Support early break/return auto-vectorization > FYI. > > -----Original Message----- > From: Gcc-patches <gcc-patches-bounces+pan2.li=intel....@gcc.gnu.org> > On Behalf Of Tamar Christina via Gcc-patches > Sent: Wednesday, June 28, 2023 9:41 PM > To: gcc-patches@gcc.gnu.org > Cc: n...@arm.com; rguent...@suse.de; j...@ventanamicro.com > Subject: [PATCH v5 0/19] Support early break/return auto-vectorization > > Hi All, > > This patch adds initial support for early break vectorization in GCC. > The support is added for any target that implements a vector cbranch optab, > this includes both fully masked and non-masked targets. > > Depending on the operation, the vectorizer may also require support for > boolean mask reductions using Inclusive OR. This is however only checked > then the comparison would produce multiple statements. > > Concretely the kind of loops supported are of the forms: > > for (int i = 0; i < N; i++) > { > <statements1> > if (<condition>) > { > ... > <action>; > } > <statements2> > } > > where <action> can be: > - break > - return > - goto > > Any number of statements can be used before the <action> occurs. > > Since this is an initial version for GCC 14 it has the following limitations > and > features: > > - Only fixed sized iterations and buffers are supported. That is to say any > vectors loaded or stored must be to statically allocated arrays with known > sizes. N must also be known. This limitation is because our primary target > for this optimization is SVE. For VLA SVE we can't easily do cross page > iteraion checks. The result is likely to also not be beneficial. For that > reason we punt support for variable buffers till we have First-Faulting > support in GCC. > - any stores in <statements1> should not be to the same objects as in > <condition>. Loads are fine as long as they don't have the possibility to > alias. More concretely, we block RAW dependencies when the intermediate > value > can't be separated fromt the store, or the store itself can't be moved. > - The number of loop iterations must be known, this is just a temporarily > limitation that I intend to address in GCC 14 itself as follow on patches. > - Prologue peeling, alignment peelinig and loop versioning are supported. > - Fully masked loops, unmasked loops and partially masked loops are > supported > - Any number of loop early exits are supported. > - The early exit must be before the natural loop exit/latch. The vectorizer > is > designed in way to propage phi-nodes downwards. As such supporting this > inverted control flow is hard. > - No support for epilogue vectorization. The only epilogue supported is the > scalar final one. Epilogue vectorization would also not be profitable. > - Early breaks are only supported for inner loop vectorization. > > I have pushed a branch to refs/users/tnfchris/heads/gcc-14-early-break > > With the help of IPA and LTO this still gets hit quite often. During > bootstrap it > hit rather frequently. Additionally TSVC s332, s481 and s482 all pass now > since these are tests for support for early exit vectorization. > > This implementation does not support completely handling the early break > inside the vector loop itself but instead supports adding checks such that if > we > know that we have to exit in the current iteration then we branch to scalar > code to actually do the final VF iterations which handles all the code in > <action>. > > niters analysis and the majority of the vectorizer with hardcoded single_exit > have been updated with the use of a new function vec_loop_iv value which > returns the exit the vectorizer wants to use as the main IV exit. > > for niters the this exit is what determines the overall iterations as that is > the > O(iters) for the loop. > > For the scalar loop we know that whatever exit you take you have to perform > at most VF iterations. For vector code we only case about the state of fully > performed iteration and reset the scalar code to the (partially) remaining > loop. > > This new version of the patch does the majority of the work in a new rewritten > loop peeling. This new function maintains LCSSA all the way through and no > longer requires the touch up functions the vectorized used to incrementally > adjust them later on. This means that aside from IV updates and guard edge > updates the early exit code is identical to the single exit cases. > > When the loop is peeled during the copying I have to go through great lengths > to keep the dominators up to date. All exits from the first loop are rewired > to > the loop header of the second loop. But this can change the immediate > dominator. > > The dominators can change again when we wire in the loop guard, as such > peeling now returns a list of dominators that need to be updated if a new > guard edge is added. > > For the loop peeling we rewrite the loop form: > > > Header > --- > |x| > 2 > | > v > -------3<------ > early exit | | | > v v | latch > 7 4----->6 > | | > | v > | 8 > | | > | v > ------>5 > > into > > Header > --- > |x| > 2 > | > v > -------3<------ > early exit | | | > v v | latch > 7 4----->6 > | | > | v > | 8 > | | > | v > | New Header > | --- > ----->|x| > 9 > | > v > ------10<----- > early exit | | | > v v | latch > 14 11---->13 > | | > | v > | 12 > | | > | v > ------> 5 > > That is to say, the first vector loop executes so long as the early exit isn't > needed. Once the exit is taken, the scalar code will perform at most VF extra > iterations. The exact number depending on peeling and iteration start and > which > exit was taken (natural or early). For this scalar loop, all early exits are > treated the same. > > When we vectorize we move any statement not related to the early break > itself and that would be incorrect to execute before the break (i.e. has side > effects) to after the break. If this is not possible we decline to vectorize. > > This means that we check at the start of iterations whether we are going to > exit or not. During the analyis phase we check whether we are allowed to do > this moving of statements. Also note that we only move the scalar > statements, but only do so after peeling but just before we start transforming > statements. > > Codegen: > > for e.g. > > #define N 803 > unsigned vect_a[N]; > unsigned vect_b[N]; > > unsigned test4(unsigned x) > { > unsigned ret = 0; > for (int i = 0; i < N; i++) > { > vect_b[i] = x + i; > if (vect_a[i] > x) > break; > vect_a[i] = x; > > } > return ret; > } > > We generate for Adv. SIMD: > > test4: > adrp x2, .LC0 > adrp x3, .LANCHOR0 > dup v2.4s, w0 > add x3, x3, :lo12:.LANCHOR0 > movi v4.4s, 0x4 > add x4, x3, 3216 > ldr q1, [x2, #:lo12:.LC0] > mov x1, 0 > mov w2, 0 > .p2align 3,,7 > .L3: > ldr q0, [x3, x1] > add v3.4s, v1.4s, v2.4s > add v1.4s, v1.4s, v4.4s > cmhi v0.4s, v0.4s, v2.4s > umaxp v0.4s, v0.4s, v0.4s > fmov x5, d0 > cbnz x5, .L6 > add w2, w2, 1 > str q3, [x1, x4] > str q2, [x3, x1] > add x1, x1, 16 > cmp w2, 200 > bne .L3 > mov w7, 3 > .L2: > lsl w2, w2, 2 > add x5, x3, 3216 > add w6, w2, w0 > sxtw x4, w2 > ldr w1, [x3, x4, lsl 2] > str w6, [x5, x4, lsl 2] > cmp w0, w1 > bcc .L4 > add w1, w2, 1 > str w0, [x3, x4, lsl 2] > add w6, w1, w0 > sxtw x1, w1 > ldr w4, [x3, x1, lsl 2] > str w6, [x5, x1, lsl 2] > cmp w0, w4 > bcc .L4 > add w4, w2, 2 > str w0, [x3, x1, lsl 2] > sxtw x1, w4 > add w6, w1, w0 > ldr w4, [x3, x1, lsl 2] > str w6, [x5, x1, lsl 2] > cmp w0, w4 > bcc .L4 > str w0, [x3, x1, lsl 2] > add w2, w2, 3 > cmp w7, 3 > beq .L4 > sxtw x1, w2 > add w2, w2, w0 > ldr w4, [x3, x1, lsl 2] > str w2, [x5, x1, lsl 2] > cmp w0, w4 > bcc .L4 > str w0, [x3, x1, lsl 2] > .L4: > mov w0, 0 > ret > .p2align 2,,3 > .L6: > mov w7, 4 > b .L2 > > and for SVE: > > test4: > adrp x2, .LANCHOR0 > add x2, x2, :lo12:.LANCHOR0 > add x5, x2, 3216 > mov x3, 0 > mov w1, 0 > cntw x4 > mov z1.s, w0 > index z0.s, #0, #1 > ptrue p1.b, all > ptrue p0.s, all > .p2align 3,,7 > .L3: > ld1w z2.s, p1/z, [x2, x3, lsl 2] > add z3.s, z0.s, z1.s > cmplo p2.s, p0/z, z1.s, z2.s > b.any .L2 > st1w z3.s, p1, [x5, x3, lsl 2] > add w1, w1, 1 > st1w z1.s, p1, [x2, x3, lsl 2] > add x3, x3, x4 > incw z0.s > cmp w3, 803 > bls .L3 > .L5: > mov w0, 0 > ret > .p2align 2,,3 > .L2: > cntw x5 > mul w1, w1, w5 > cbz w5, .L5 > sxtw x1, w1 > sub w5, w5, #1 > add x5, x5, x1 > add x6, x2, 3216 > b .L6 > .p2align 2,,3 > .L14: > str w0, [x2, x1, lsl 2] > cmp x1, x5 > beq .L5 > mov x1, x4 > .L6: > ldr w3, [x2, x1, lsl 2] > add w4, w0, w1 > str w4, [x6, x1, lsl 2] > add x4, x1, 1 > cmp w0, w3 > bcs .L14 > mov w0, 0 > ret > > On the workloads this work is based on we see between 2-3x performance > uplift using this patch. > > Follow up plan: > - Boolean vectorization has several shortcomings. I've filed PR110223 with > the > bigger ones that cause vectorization to fail with this patch. > - SLP support. This is planned for GCC 15 as for majority of the cases build > SLP itself fails. This means I'll need to spend time in making this more > robust first. Additionally it requires: > * Adding support for vectorizing CFG (gconds) > * Support for CFG to differ between vector and scalar loops. > Both of which would be disruptive to the tree and I suspect I'll be > handling > fallouts from this patch for a while. So I plan to work on the surrounding > building blocks first for the remainder of the year. > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > Also ran across various workloads and no issues. > > When closer to acceptance I will run on other targets as well and clean up > related testsuite fallouts there. > > --- inline copy of patch -- > > --