return auto-vectorization

Tamar Christina via Gcc-patches Wed, 28 Jun 2023 09:01:18 -0700

Hi Juzhe,

> 
> Hi, Tamar.
> 
> This is an amazing auto-vectorization flow.
> 
> I am thinking about whether RVV can also get benefits from this optimization.
> IMHO, RVV should be also using this flow.
> 
> So, to allow RVV  (target uses len as loop_control and mask as flow control), 
> I
> am not sure whether we can do this (Feel free to correct me if I am wrong):
> 
> +      if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
> +     vect_record_loop_mask (loop_vinfo, masks, ncopies, truth_type,
> NULL);
> 
> Maybe it can be ?
> 
> if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)) {
>   if (mask_loop_p)
>      vect_record_loop_mask
>    else
>      vect_record_loop_len
> }
>


Yeah, that should be the only change required,  I started this patch before the 
loop_len change
made it in and just rebased recently 😊

> 
> +  tree cond = gimple_assign_lhs (new_stmt);
> +  if (masked_loop_p)
> +    {
> +      tree mask = vect_get_loop_mask (loop_vinfo, gsi, masks, ncopies,
> truth_type, 0);
> +      cond = prepare_vec_mask (loop_vinfo, TREE_TYPE (mask), mask, cond,
> +                            &cond_gsi);
> +    }
> +
> +  tree t = fold_build2 (NE_EXPR, boolean_type_node, cond,
> +                     build_zero_cst (truth_type));
> 
> From my understanding, you are using final_mask = loop_mask (WHILE_ULT)
> && control_mask (comparison).
> Then Test final_mask using NE_EXPR. Am I right?

Yeah that's right, It's creating the mask for partial iterations.  The only 
other constraint is
being able to reduce a boolean mask using inclusive OR,  but that's optional 
and is only
needed if one side of the comparison produces more than 1 copy (so it's only 
checked then).

> 
> For RVV, I thinking whether we can have a good way to do this testing.
> Not sure whether we can have something like LEN_TEST_MASK_NE (loop_len,
> control_mask...)
> 

Hmm Is just the vect_record_loop_len change not enough? (I haven't followed the 
masking
implementation in RVV in detail) but I assume that it's following the general 
principle than
& an operation with a mask creates a masked operation?

That is to say, I thought LOOP_LEN was only for the loop control? Which doesn't 
change here.

> I am not saying that we should support "early break" auto-vectorization for
> RVV (loop_len && control_mask).
> I am just write some comments trying to figure out how I can adapt your
> working for RVV in the future.
> 

Yes happy to help, the more uses it gets the more bugs I can fix 😊

Cheers,
Tamar

> Thanks.
> 
> 
> juzhe.zh...@rivai.ai
> 
> From: Li, Pan2
> Date: 2023-06-28 22:21
> To: juzhe.zh...@rivai.ai
> Subject: FW: [PATCH v5 0/19] Support early break/return auto-vectorization
> FYI.
> 
> -----Original Message-----
> From: Gcc-patches <gcc-patches-bounces+pan2.li=intel....@gcc.gnu.org>
> On Behalf Of Tamar Christina via Gcc-patches
> Sent: Wednesday, June 28, 2023 9:41 PM
> To: gcc-patches@gcc.gnu.org
> Cc: n...@arm.com; rguent...@suse.de; j...@ventanamicro.com
> Subject: [PATCH v5 0/19] Support early break/return auto-vectorization
> 
> Hi All,
> 
> This patch adds initial support for early break vectorization in GCC.
> The support is added for any target that implements a vector cbranch optab,
> this includes both fully masked and non-masked targets.
> 
> Depending on the operation, the vectorizer may also require support for
> boolean mask reductions using Inclusive OR.  This is however only checked
> then the comparison would produce multiple statements.
> 
> Concretely the kind of loops supported are of the forms:
> 
> for (int i = 0; i < N; i++)
> {
>    <statements1>
>    if (<condition>)
>      {
>        ...
>        <action>;
>      }
>    <statements2>
> }
> 
> where <action> can be:
> - break
> - return
> - goto
> 
> Any number of statements can be used before the <action> occurs.
> 
> Since this is an initial version for GCC 14 it has the following limitations 
> and
> features:
> 
> - Only fixed sized iterations and buffers are supported.  That is to say any
>   vectors loaded or stored must be to statically allocated arrays with known
>   sizes. N must also be known.  This limitation is because our primary target
>   for this optimization is SVE.  For VLA SVE we can't easily do cross page
>   iteraion checks. The result is likely to also not be beneficial. For that
>   reason we punt support for variable buffers till we have First-Faulting
>   support in GCC.
> - any stores in <statements1> should not be to the same objects as in
>   <condition>.  Loads are fine as long as they don't have the possibility to
>   alias.  More concretely, we block RAW dependencies when the intermediate
> value
>   can't be separated fromt the store, or the store itself can't be moved.
> - The number of loop iterations must be known,  this is just a temporarily
>   limitation that I intend to address in GCC 14 itself as follow on patches.
> - Prologue peeling, alignment peelinig and loop versioning are supported.
> - Fully masked loops, unmasked loops and partially masked loops are
> supported
> - Any number of loop early exits are supported.
> - The early exit must be before the natural loop exit/latch.  The vectorizer 
> is
>   designed in way to propage phi-nodes downwards.  As such supporting this
>   inverted control flow is hard.
> - No support for epilogue vectorization.  The only epilogue supported is the
>   scalar final one.  Epilogue vectorization would also not be profitable.
> - Early breaks are only supported for inner loop vectorization.
> 
> I have pushed a branch to refs/users/tnfchris/heads/gcc-14-early-break
> 
> With the help of IPA and LTO this still gets hit quite often.  During 
> bootstrap it
> hit rather frequently.  Additionally TSVC s332, s481 and s482 all pass now
> since these are tests for support for early exit vectorization.
> 
> This implementation does not support completely handling the early break
> inside the vector loop itself but instead supports adding checks such that if 
> we
> know that we have to exit in the current iteration then we branch to scalar
> code to actually do the final VF iterations which handles all the code in
> <action>.
> 
> niters analysis and the majority of the vectorizer with hardcoded single_exit
> have been updated with the use of a new function vec_loop_iv value which
> returns the exit the vectorizer wants to use as the main IV exit.
> 
> for niters the this exit is what determines the overall iterations as that is 
> the
> O(iters) for the loop.
> 
> For the scalar loop we know that whatever exit you take you have to perform
> at most VF iterations.  For vector code we only case about the state of fully
> performed iteration and reset the scalar code to the (partially) remaining 
> loop.
> 
> This new version of the patch does the majority of the work in a new rewritten
> loop peeling.  This new function maintains LCSSA all the way through and no
> longer requires the touch up functions the vectorized used to incrementally
> adjust them later on.  This means that aside from IV updates and guard edge
> updates the early exit code is identical to the single exit cases.
> 
> When the loop is peeled during the copying I have to go through great lengths
> to keep the dominators up to date.  All exits from the first loop are rewired 
> to
> the loop header of the second loop.  But this can change the immediate
> dominator.
> 
> The dominators can change again when we wire in the loop guard, as such
> peeling now returns a list of dominators that need to be updated if a new
> guard edge is added.
> 
> For the loop peeling we rewrite the loop form:
> 
> 
>                      Header
>                       ---
>                       |x|
>                        2
>                        |
>                        v
>                 -------3<------
>      early exit |      |      |
>                 v      v      | latch
>                 7      4----->6
>                 |      |
>                 |      v
>                 |      8
>                 |      |
>                 |      v
>                 ------>5
> 
> into
> 
>                      Header
>                       ---
>                       |x|
>                        2
>                        |
>                        v
>                 -------3<------
>      early exit |      |      |
>                 v      v      | latch
>                 7      4----->6
>                 |      |
>                 |      v
>                 |      8
>                 |      |
>                 |      v
>                 |  New Header
>                 |     ---
>                 ----->|x|
>                        9
>                        |
>                        v
>                 ------10<-----
>      early exit |      |      |
>                 v      v      | latch
>                 14     11---->13
>                 |      |
>                 |      v
>                 |      12
>                 |      |
>                 |      v
>                 ------> 5
> 
> That is to say, the first vector loop executes so long as the early exit isn't
> needed.  Once the exit is taken, the scalar code will perform at most VF extra
> iterations.  The exact number depending on peeling and iteration start and
> which
> exit was taken (natural or early).   For this scalar loop, all early exits are
> treated the same.
> 
> When we vectorize we move any statement not related to the early break
> itself and that would be incorrect to execute before the break (i.e. has side
> effects) to after the break.  If this is not possible we decline to vectorize.
> 
> This means that we check at the start of iterations whether we are going to
> exit or not.  During the analyis phase we check whether we are allowed to do
> this moving of statements.  Also note that we only move the scalar
> statements, but only do so after peeling but just before we start transforming
> statements.
> 
> Codegen:
> 
> for e.g.
> 
> #define N 803
> unsigned vect_a[N];
> unsigned vect_b[N];
> 
> unsigned test4(unsigned x)
> {
> unsigned ret = 0;
> for (int i = 0; i < N; i++)
> {
>    vect_b[i] = x + i;
>    if (vect_a[i] > x)
>      break;
>    vect_a[i] = x;
> 
> }
> return ret;
> }
> 
> We generate for Adv. SIMD:
> 
> test4:
>         adrp    x2, .LC0
>         adrp    x3, .LANCHOR0
>         dup     v2.4s, w0
>         add     x3, x3, :lo12:.LANCHOR0
>         movi    v4.4s, 0x4
>         add     x4, x3, 3216
>         ldr     q1, [x2, #:lo12:.LC0]
>         mov     x1, 0
>         mov     w2, 0
>         .p2align 3,,7
> .L3:
>         ldr     q0, [x3, x1]
>         add     v3.4s, v1.4s, v2.4s
>         add     v1.4s, v1.4s, v4.4s
>         cmhi    v0.4s, v0.4s, v2.4s
>         umaxp   v0.4s, v0.4s, v0.4s
>         fmov    x5, d0
>         cbnz    x5, .L6
>         add     w2, w2, 1
>         str     q3, [x1, x4]
>         str     q2, [x3, x1]
>         add     x1, x1, 16
>         cmp     w2, 200
>         bne     .L3
>         mov     w7, 3
> .L2:
>         lsl     w2, w2, 2
>         add     x5, x3, 3216
>         add     w6, w2, w0
>         sxtw    x4, w2
>         ldr     w1, [x3, x4, lsl 2]
>         str     w6, [x5, x4, lsl 2]
>         cmp     w0, w1
>         bcc     .L4
>         add     w1, w2, 1
>         str     w0, [x3, x4, lsl 2]
>         add     w6, w1, w0
>         sxtw    x1, w1
>         ldr     w4, [x3, x1, lsl 2]
>         str     w6, [x5, x1, lsl 2]
>         cmp     w0, w4
>         bcc     .L4
>         add     w4, w2, 2
>         str     w0, [x3, x1, lsl 2]
>         sxtw    x1, w4
>         add     w6, w1, w0
>         ldr     w4, [x3, x1, lsl 2]
>         str     w6, [x5, x1, lsl 2]
>         cmp     w0, w4
>         bcc     .L4
>         str     w0, [x3, x1, lsl 2]
>         add     w2, w2, 3
>         cmp     w7, 3
>         beq     .L4
>         sxtw    x1, w2
>         add     w2, w2, w0
>         ldr     w4, [x3, x1, lsl 2]
>         str     w2, [x5, x1, lsl 2]
>         cmp     w0, w4
>         bcc     .L4
>         str     w0, [x3, x1, lsl 2]
> .L4:
>         mov     w0, 0
>         ret
>         .p2align 2,,3
> .L6:
>         mov     w7, 4
>         b       .L2
> 
> and for SVE:
> 
> test4:
>         adrp    x2, .LANCHOR0
>         add     x2, x2, :lo12:.LANCHOR0
>         add     x5, x2, 3216
>         mov     x3, 0
>         mov     w1, 0
>         cntw    x4
>         mov     z1.s, w0
>         index   z0.s, #0, #1
>         ptrue   p1.b, all
>         ptrue   p0.s, all
>         .p2align 3,,7
> .L3:
>         ld1w    z2.s, p1/z, [x2, x3, lsl 2]
>         add     z3.s, z0.s, z1.s
>         cmplo   p2.s, p0/z, z1.s, z2.s
>         b.any   .L2
>         st1w    z3.s, p1, [x5, x3, lsl 2]
>         add     w1, w1, 1
>         st1w    z1.s, p1, [x2, x3, lsl 2]
>         add     x3, x3, x4
>         incw    z0.s
>         cmp     w3, 803
>         bls     .L3
> .L5:
>         mov     w0, 0
>         ret
>         .p2align 2,,3
> .L2:
>         cntw    x5
>         mul     w1, w1, w5
>         cbz     w5, .L5
>         sxtw    x1, w1
>         sub     w5, w5, #1
>         add     x5, x5, x1
>         add     x6, x2, 3216
>         b       .L6
>         .p2align 2,,3
> .L14:
>         str     w0, [x2, x1, lsl 2]
>         cmp     x1, x5
>         beq     .L5
>         mov     x1, x4
> .L6:
>         ldr     w3, [x2, x1, lsl 2]
>         add     w4, w0, w1
>         str     w4, [x6, x1, lsl 2]
>         add     x4, x1, 1
>         cmp     w0, w3
>         bcs     .L14
>         mov     w0, 0
>         ret
> 
> On the workloads this work is based on we see between 2-3x performance
> uplift using this patch.
> 
> Follow up plan:
> - Boolean vectorization has several shortcomings.  I've filed PR110223 with
> the
>    bigger ones that cause vectorization to fail with this patch.
> - SLP support.  This is planned for GCC 15 as for majority of the cases build
>    SLP itself fails.  This means I'll need to spend time in making this more
>    robust first.  Additionally it requires:
>      * Adding support for vectorizing CFG (gconds)
>      * Support for CFG to differ between vector and scalar loops.
>    Both of which would be disruptive to the tree and I suspect I'll be 
> handling
>    fallouts from this patch for a while.  So I plan to work on the surrounding
>    building blocks first for the remainder of the year.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> Also ran across various workloads and no issues.
> 
> When closer to acceptance I will run on other targets as well and clean up
> related testsuite fallouts there.
> 
> --- inline copy of patch --
> 
> --

RE: FW: [PATCH v5 0/19] Support early break/return auto-vectorization

Reply via email to