On Mon, 15 Jul 2024, Tamar Christina wrote:

> Hi All,
> 
> This RFC document covers at a high level how to extend early break support in
> GCC to support SLP and how this will be extended in the future to support
> full control flow in GCC.
> 
> The basic idea in this is based on the paper "All You Need Is Superword-Level
> Parallelism: Systematic Control-Flow Vectorization with SLP"[1] but it is
> adapted to fit within the GCC vectorizer.

An interesting read - I think the approach is viable for loop 
vectorization where we schedule the whole vectorized loop but difficult
for basic-block vectorization as they seem to re-build the whole function
from scratch.  They also do not address how to code-generate predicated
not vectorized code or how they decide to handle mixed vector/non-vector
code at all.  For example I don't think any current CPU architecture
supports a full set of predicated _scalar_ ops and not every scalar
op would have a vector equivalent in case one would use single-lane
vectors.

For GCCs basic-block vectorization the main outstanding issue is one
of scheduling and dependences with scalar code (live lane extracts,
vector builds from scalars) as well.

> What will not be covered is the support for First-Faulting Loads nor 
> alignment peeling as these are done by a different engineer.
> 
> == Concept and Theory ==
> 
> Supporting Early Break in SLP requires the same theory as general control 
> flow,
> the only difference is in that Early break can be supported for non-masked
> architectures while full control flow support requires a fully masked
> architecture.
> 
> In GCC 14 Early break was added for non-SLP by teaching the vectorizer to 
> branch
> to a scalar epilogue whenever an early exit is taken.  This means the 
> vectorizer
> itself doesn't need to know how to perform side-effects or final reductions.

With current GCC we avoid the need of predicated stmts by using the scalar
epilog and a branch-on-all-true/false stmt.  To make that semantically
valid stmts are moved around.

> The additional benefit of this is that it works for any target providing a
> vector cbranch optab implementation since we don't require masking support.  
> In
> order for this to work we need to have code motion that moves side effects
> (primarily stores) into the latch basic block.  i.e. we delay any side-effects
> up to a point where we know the full vector iteration would have been 
> performed.
> For this to work however we had to disable support for epilogue vectorization 
> as
> when the scalar statements are moved we break the link to the original BB they
> were in.  This means that the stmt_vinfo for the stores that need to be moved
> will no longer be valid for epilogue vectorization and the only way to recover
> this would be to perform a full dataflow analysis again.  We decided against
> this as the plan of record was to switch to SLP.
> 
> -- Predicated SSA --
> 
> The core of the proposal is to support a form of Predicated SSA (PSSA) in the
> vectorizer [2]. The idea is to assign a control predicate to every SSA
> statement.  This control predicate denotes their dependencies wrt to the BB 
> they
> are in and defines the dependencies between statements.
> 
> As an example the following early break code:
> 
> unsigned test4(unsigned x)
> {
>  unsigned ret = 0;
>  for (int i = 0; i < N; i++)
>  {
>    vect_b[i] = x + i;
>    if (vect_a[i]*2 != x)
>      break;
>    vect_a[i] = x;
>  
>  }
>  return ret;
> }
> 
> is transformed into
> 
> unsigned test4(unsigned x)
> {
>  unsigned ret = 0;
>  for (int i = 0; i < N; i++)
>  {
>    vect_b[i] = x + i;      :: !(vect_a[i]*2 != x)

why's this not 'true'?

>    if (vect_a[i]*2 != x)   :: true
>      break;                :: vect_a[i]*2 != x
>    vect_a[i] = x;          :: !(vect_a[i]*2 != x)
>  
>  }
>  return ret;
> }
> 
> A more complicated example:
> 
> int foo (int n)
> {
>     int res = 0;
>     for (int i = 0; i < n; i++)
>       {
>         y[i] = x[i] * 2;      :: true
>         res += x[i] + y[i];   :: true
>  
>         if (x[i] > 5)         :: true
>           break;              :: x[i] > 5
>  
>         if (z[i] > 5)         :: !(x[i] > 5)
>           break;              :: !(x[i] > 5) && (z[i] > 5)
>       }
>  
>     return res;
> }
> 
> any statement guarded by a block with a non-true predicate can be simplified, 
> so
> the above can be turned into
> 
> int foo (int n)
> {
>     int res = 0;
>     for (int i = 0; i < n; i++)
>       {
>         y[i] = x[i] * 2;      :: true
>         res += x[i] + y[i];   :: true
>  
>         if (x[i] > 5)         :: true
>           break;              :: x[i] > 5
>  
>         if (z[i] > 5)         :: !(x[i] > 5)
>           break;              :: (z[i] > 5)
>       }
>  
>     return res;
> }
> 
> if we make the predicates a *gimple, where true is just NULL;

I'd make predicates a SSA name instead (or possibly abstract this
so we can more easily change our minds) - the disadvantage is
that gcond * doesn't have a SSA def.

We might want to look into tree-if-conv.cc (remotely possibly into
gimple-predicate-analysis.{h,cc}) since there are multiple places in
GCC where we need to maintain a composition of individual predicates
(and possibly compare / simplify such compositions).

> Lastly examples such as:
> 
> unsigned test4(unsigned x, unsigned y)
> {
>  unsigned ret = 0;
>  unsigned sum = 0;
>  for (int i = 0; i < N; i++)
>  {
>    vect_b[i] = x + i;
>    if (vect_a[i] > x)
>      return vect_a[i];
>  
>   vect_b[i] += x + i;
>   if (vect_a[i] > y)
>       return sum;
>   sum += vect_a[i];
>   vect_a[i] = x;
>  }
>  return ret;
> }
> 
> are tagged as:
> 
> unsigned test4(unsigned x, unsigned y)
> {
>  unsigned ret = 0;
>  unsigned sum = 0;
>  for (int i = 0; i < N; i++)
>  {
>    vect_b[i] = x + i;    :: !(vect_a[i] > y)
>    if (vect_a[i] > x)    :: true
>      return vect_a[i];   :: vect_a[i] > x
>  
>   vect_b[i] += x + i;    :: !(vect_a[i] > y)
>   if (vect_a[i] > y)     :: !(vect_a[i] > x)
>       return sum;        :: vect_a[i] > y
>   sum += vect_a[i];      :: !(vect_a[i] > y)
>   vect_a[i] = x;         :: !(vect_a[i] > y)
>  }
>  return ret;
> }

I'll note you have all early-exit examples.  Whatever we build should
ideally able to handle cases we currently if-convert.

> This representation has a number of benefits:
> 
> 1. code-motion becomes just SLP scheduling, where scheduling has to take 
> account
>    not to schedule blocks across eachother which have a data dependency. This
>    becomes simpler when we build the SLP tree we can merge blocks with the 
> same
>    control dependencies.
> 2. this representation also helps for the case where we want to stay inside 
> the
>    vector loop, as the predicate tells us which mask we need to use to compute
>    the reduction values.
> 3. this representation is easily extendable to general control flow support
>    since the statements will all be explicitly guarded by  a predicate.  The 
> PHI
>    node can be given a special type, to indicate to codegen that a branch 
> should
>    be generated.

Indeed.

Now, we're for example currently missing the dependence on the loop
mask computation for fully-masked loops as we're not representing its
computation nor the implicit uses on SLP nodes.

Each predicate would be a dependence on the predicate computation itself,
is that predicate computation supposed to be a SLP node?

> conditions which read from the same sources can be merged (SLP'ed) if the
> statements in the early exit block are the same.
> 
> That is, the above can be handled as:
> 
> unsigned test4(unsigned x, unsigned y)
> {
>  unsigned ret = 0;
>  unsigned sum = 0;
>  for (int i = 0; i < N; i++)
>  {
>    if (vect_a[i] > x)    :: true
>      return vect_a[i];   :: vect_a[i] > x
>  
>    if (vect_a[i] > y)    :: !(vect_a[i] > x)
>      return sum;         :: vect_a[i] > y
>  
>    vect_b[i] = x + i;    :: !(vect_a[i] > y)
>    vect_b[i] += x + i;   :: !(vect_a[i] > y)
>    sum += vect_a[i];     :: !(vect_a[i] > y)
>    vect_a[i] = x;        :: !(vect_a[i] > y)
>  }
>  return ret;
> }
> 
> But the following:
> 
> unsigned test4(unsigned x, unsigned y)
> {
>  unsigned ret = 0;
>  unsigned sum = 0;
>  for (int i = 0; i < N; i++)
>  {
>    vect_b[i] = x + i;    :: !(vect_a[i] > y)
>    if (vect_a[i] > x)    :: true
>      break;              :: vect_a[i] > x
>  
>   vect_b[i] += x + i;    :: !(vect_a[i] > y)
>   if (vect_a[i] > y)     :: !(vect_a[i] > x)
>      break;              :: vect_a[i] > y
>   sum += vect_a[i];      :: !(vect_a[i] > y)
>   vect_a[i] = x;         :: !(vect_a[i] > y)
>  }
>  return ret;
> }
> 
> becomes:
> 
> unsigned test4(unsigned x, unsigned y)
> {
>  unsigned ret = 0;
>  unsigned sum = 0;
>  for (int i = 0; i < N; i++)
>  {
>    if (vect_a[i] > x || vect_a[i] > y)    :: true
>      break;              :: vect_a[i] > x
>  
>    vect_b[i] = x + i;    :: !(vect_a[i] > y)
>    vect_b[i] += x + i;   :: !(vect_a[i] > y)
>    sum += vect_a[i];     :: !(vect_a[i] > y)
>    vect_a[i] = x;        :: !(vect_a[i] > y)
>  }
>  return ret;
> }
> 
> In addition for this to work the vectorizer must be changed to generate code
> solely based on the SLP tree and not directly from the BB.

Yep - see my comments on the paper above and about the existing issues
with the BB vectorizer.  I'll remind myself that in future SLP build
should start with a single-lane SLP node for each SSA def so we'd have
SLP nodes for both the scalar and the vector part of a BB which might
"solve" this if we chose to never "schedule" scalar SLP nodes
(we currently tend to create cyclic dependences in some cases).

> Note: that the the control predicates should be easy to determine through the
> post-order dominance tree.

The paper (and the talk) talks about loops and the need to represent
them with a special "AST" node.  I think without doing that this restricts
us to non-loopy CFGs (like no outer loop vectorization)?

> == Implementation ==
> 
> The high level steps for implementing this implementation and the order I'll
> work is:
> 
> 1. Detach SLP codegen from BB
> 2. Teach build SLP to assign control predicates to relevant statements. It 
> might
>    be easier to do this during loop analysis,  but I don't think that'll work
>    for BB SLP. Thoughts?
> 3. Teach optimize SLP to merge blocks with similar control predicates
> 4. Teach SLP scheduling to move blocks
> 5. Teach vectorizable_early_break to produce new CFG

Let me ask you to start with 0. the minimum viable "hack" to make
SLP of early break work.  For 1. to 5. I'd first concentrate on loop SLP
vectorization because there "detached codegen" is trivial (ha, joking
of course).  I'd probably go as far as to re-do SLP discovery between
0. and tackling 1. - SLP discovery for loops, that is.

I think for 0. we want to be able to attach a [set of] predicates to
each SLP node.  With that we can represent loop masking and early
break conditions.  Those [set of] predicates would be an alternate
unsorted vector of "SLP children" and the SLP children would be
the predicate computation SLP nodes.  We'd have the chance to
combine the set of predicates down to one by adding intermediate
merging SLP nodes (in the end code generation just accepts one
predicate but discovery for simplicity(?) would add many, or two).

For early break and without 1. we'd have to "fix" the position of
some SLP nodes in the CFG, namely the "cbranch" SLP node (maybe
that's just a predicate compute node of special kind).  I think
I mentioned adding an optional gimple_stmt_iterator to SLP nodes
to support this (or special-case the cbranch node during scheduling
by looking at the gcond * that's inevitably there).  The loop
mask compute would get a gsi_after_labels for example.

In principle if we have this as 0. then 1. should become just
an engineering exercise.

I'm trying to make you focus on 0. as I'm really eager to make
the only-SLP transition this stage1 and doing all the neat stuff
will take longer.

> == Switch vectorization support ==
> 
> The following code
> 
> short a[100];
>  
> int foo(int n, int counter)
> {
>    for (int i = 0; i < n; i++)
>      {
>         if (a[i] == 1 || a[i] == 2 || a[i] == 7 || a[i] == 4)
>           return 1;
>      }
>     return 0;
> }
> 
> fails to vectorize at -O3 -march=armv9-a because in GIMPLE the if is rewritten
> into a switch statement:
> 
>   <bb 2> [local count: 114863530]:
>   if (n_6(D) > 0)
>     goto <bb 6>; [94.50%]
>   else
>     goto <bb 11>; [5.50%]
>  
>   <bb 6> [local count: 108546036]:
>  
>   <bb 3> [local count: 1014686025]:
>   # i_3 = PHI <i_8(7), 0(6)>
>   _1 = a[i_3];
>   switch (_1) <default: <L9> [94.50%], case 1 ... 2: <L11> [5.50%], case 4: 
> <L11> [5.50%], case 7: <L11> [5.50%]>
>  
>   <bb 9> [local count: 55807731]:
> <L11>:
>   goto <bb 5>; [100.00%]
>  
>   <bb 4> [local count: 958878295]:
> <L9>:
>   i_8 = i_3 + 1;
>   if (n_6(D) > i_8)
>     goto <bb 7>; [94.50%]
>   else
>     goto <bb 12>; [5.50%]
>  
>   <bb 12> [local count: 52738306]:
>   goto <bb 8>; [100.00%]
>  
>   <bb 7> [local count: 906139989]:
>   goto <bb 3>; [100.00%]
>  
>   <bb 11> [local count: 6317494]:
>  
>   <bb 8> [local count: 59055800]:
>  
>   <bb 5> [local count: 114863531]:
>   # _5 = PHI <1(9), 0(8)>
> <L10>:
>   return _5;
> 
> However such switch statements, where all the entries lead to the same
> destination are easy to vectorize.  In SVE we have the MATCH instruction that
> can be used here, and for others we can duplicate the constants and lower the
> switch to a series of compare and ORRs.
> 
> This is similar as what's done for when the values don't fit inside a switch:
> 
> short a[100];
>  
> int foo(int n, int counter, short x, short b, short c)
> {
>    for (int i = 0; i < n; i++)
>      {
>         if (a[i] == x || a[i] == b || a[i] == 7 || a[i] == c)
>           return 1;
>      }
>     return 0;
> }
> 
> is vectorized as:
> 
> .L4:
>         incb    x5
>         incw    z30.s, all, mul #2
>         cmp     w6, w1
>         bcc     .L15
> .L6:
>         ld1h    z31.h, p7/z, [x5]
>         cmpeq   p15.h, p7/z, z31.h, z27.h
>         cmpeq   p11.h, p7/z, z31.h, z28.h
>         cmpeq   p14.h, p7/z, z31.h, #7
>         cmpeq   p12.h, p7/z, z31.h, z29.h
>         orr     p15.b, p7/z, p15.b, p11.b
>         orr     p14.b, p7/z, p14.b, p12.b
>         inch    x1
>         orr     p15.b, p7/z, p15.b, p14.b
>         ptest   p13, p15.b
>         b.none  .L4
>         umov    w1, v30.s[0]
> .L3:
>         sxtw    x1, w1
>         b       .L7
> 
> which is great! but should be an SVE MATCH instruction as well.
> 
> This kind of code shows up through the use of std::find_if as well:
> 
> #include <bits/stdc++.h>
> using namespace std;
>   
> bool pred(int i) { return i == 1 || i == 2 || i == 7 || i == 4; }
>  
> int foo(vector<int> vec)
> {
>     vector<int>::iterator it;
>  
>     it = find_if(vec.begin(), vec.end(), pred);
>   
>     return *it;
> }
> 
> and once the unrolled loop is gone we should be able to vectorize it.
> 
> My proposal is to create two new IFNs and optabs:
> 
> IFN_MATCH, IFN_NMATCH and have a vectorizer pattern recognize the ORR 
> reduction
> cases into MATCH and have ifcvt lower the switch into ORR statements.
> 
> Such switch statements only have two branches and so are identical to cbranch
> for codegen and vectorization.
> 
> The unrolled ORR are replace by match and so the gimple becomes:
> 
> a = .MATCH (...)
> if (a != {0,..})
> and so no other change is needed for codegen.

This sounds separate from the rest, I think teaching if-conversion to
lower (small, or simple by some measure) switches is the way to go
for now.  There's existing code to produce ifs from switch and those
can be if-converted in the classical way (in the if (vectorized) loop
copy only, of course).

An alternative with IFN_MATCH sounds good in principle.

> == Better cbranch support ==
> 
> At the moment, one of the big downsides of re-using the existing cbranch is 
> that
> in the target we can't tell whether the result of the cbranch is actually 
> needed
> or not.
> 
> i.e. for SVE we can't tell if the predicate is needed.  For the cases where we
> don't stay inside the vector loop we can generate more efficient code if we 
> know
> that the loop only cares about any or all bits set and doesn't need to know
> which one.
> 
> For this reason I propose adding new optabs cbranch_any_ and branch_all_ and
> have emit_cmp_and_jump_insns lower to these when appropriate.

Hmm, but isn't this then more a vec_cmpeq_any that produces a scalar
rather than a vector and then a regular scalar compare-and-jump?  That is,
does SVE have such a compare instruction?  Can you show the current
code-gen and how the improved one would look like?

> Are the general idea and steps OK?

See above.  Thanks for the write-up.

Richard.

> If so I'll start implementation now.
> 
> Thanks,
> Tamar
> 
> [1] Yishen Chen, Charith Mendis, and Saman Amarasinghe. 2022. All
> You Need Is Superword-Level Parallelism: Systematic Control-Flow
> Vectorization with SLP. In Proceedings of the 43rd ACM SIGPLAN
> International Conference on Programming Language Design and
> Implementation (PLDI '22), June 13?17, 2022, San Diego, CA, USA.
> ACM, NewYork, NY, USA, 15 pages. https://doi.org/10.1145/3519939.
> 3523701
> 
> [2] Predicated Static Single Assignment
> 
> Lori Carter Beth Simon Brad Calder Larry Carter Jeanne Ferrante
> Department of Computer Science and Engineering
> University of California, San Diego
> flcarter,esimon,calder,carter,ferran...@cs.ucsd.edu
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to