On Tue, 30 Jul 2024, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <rguent...@suse.de> > > Sent: Thursday, July 18, 2024 10:00 AM > > To: Tamar Christina <tamar.christ...@arm.com> > > Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Sandiford > > <richard.sandif...@arm.com> > > Subject: RE: [RFC][middle-end] SLP Early break and control flow support in > > GCC > > > > On Wed, 17 Jul 2024, Tamar Christina wrote: > > > > > > -----Original Message----- > > > > From: Richard Biener <rguent...@suse.de> > > > > Sent: Tuesday, July 16, 2024 4:08 PM > > > > To: Tamar Christina <tamar.christ...@arm.com> > > > > Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Sandiford > > > > <richard.sandif...@arm.com> > > > > Subject: Re: [RFC][middle-end] SLP Early break and control flow support > > > > in GCC > > > > > > > > On Mon, 15 Jul 2024, Tamar Christina wrote: > > > > > > > > > Hi All, > > > > > > > > > > This RFC document covers at a high level how to extend early break > > > > > support in > > > > > GCC to support SLP and how this will be extended in the future to > > > > > support > > > > > full control flow in GCC. > > > > > > > > > > The basic idea in this is based on the paper "All You Need Is > > > > > Superword-Level > > > > > Parallelism: Systematic Control-Flow Vectorization with SLP"[1] but > > > > > it is > > > > > adapted to fit within the GCC vectorizer. > > > > > > > > An interesting read - I think the approach is viable for loop > > > > vectorization where we schedule the whole vectorized loop but difficult > > > > for basic-block vectorization as they seem to re-build the whole > > > > function > > > > from scratch. They also do not address how to code-generate predicated > > > > not vectorized code or how they decide to handle mixed vector/non-vector > > > > code at all. For example I don't think any current CPU architecture > > > > supports a full set of predicated _scalar_ ops and not every scalar > > > > op would have a vector equivalent in case one would use single-lane > > > > vectors. > > > > > > Hmm I'm guessing you mean here, they don't address for BB vectorization > > > how to deal with the fact that you may not always be able to vectorize the > > > entire function up from the seed? I thought today we dealt with that by > > > splitting during discovery? Can we not do the same? i.e. treat them as > > > externals? > > > > Currently not all scalar stmts are even reached by the seeds we use. > > They seem to simply rewrite the whole function into predicated form > > and I don't see how that is a transform that gets you back 1:1 the old > > code (or code of at least the same quality) if not all statements end > > up vectorized? > > > > Sure we could build up single-lane SLP instances for "everything", > > but then how do you code-generate these predicated single-lane SLP > > instances? > > > > > > > > > > For GCCs basic-block vectorization the main outstanding issue is one > > > > of scheduling and dependences with scalar code (live lane extracts, > > > > vector builds from scalars) as well. > > > > > > > > > What will not be covered is the support for First-Faulting Loads nor > > > > > alignment peeling as these are done by a different engineer. > > > > > > > > > > == Concept and Theory == > > > > > > > > > > Supporting Early Break in SLP requires the same theory as general > > > > > control > > flow, > > > > > the only difference is in that Early break can be supported for > > > > > non-masked > > > > > architectures while full control flow support requires a fully masked > > > > > architecture. > > > > > > > > > > In GCC 14 Early break was added for non-SLP by teaching the > > > > > vectorizer to > > > > branch > > > > > to a scalar epilogue whenever an early exit is taken. This means the > > > > > vectorizer > > > > > itself doesn't need to know how to perform side-effects or final > > > > > reductions. > > > > > > > > With current GCC we avoid the need of predicated stmts by using the > > > > scalar > > > > epilog and a branch-on-all-true/false stmt. To make that semantically > > > > valid stmts are moved around. > > > > > > > > > The additional benefit of this is that it works for any target > > > > > providing a > > > > > vector cbranch optab implementation since we don't require masking > > support. > > > > In > > > > > order for this to work we need to have code motion that moves side > > > > > effects > > > > > (primarily stores) into the latch basic block. i.e. we delay any > > > > > side-effects > > > > > up to a point where we know the full vector iteration would have been > > > > performed. > > > > > For this to work however we had to disable support for epilogue > > > > > vectorization > > as > > > > > when the scalar statements are moved we break the link to the > > > > > original BB > > they > > > > > were in. This means that the stmt_vinfo for the stores that need to > > > > > be moved > > > > > will no longer be valid for epilogue vectorization and the only way > > > > > to recover > > > > > this would be to perform a full dataflow analysis again. We decided > > > > > against > > > > > this as the plan of record was to switch to SLP. > > > > > > > > > > -- Predicated SSA -- > > > > > > > > > > The core of the proposal is to support a form of Predicated SSA > > > > > (PSSA) in the > > > > > vectorizer [2]. The idea is to assign a control predicate to every SSA > > > > > statement. This control predicate denotes their dependencies wrt to > > > > > the BB > > they > > > > > are in and defines the dependencies between statements. > > > > > > > > > > As an example the following early break code: > > > > > > > > > > unsigned test4(unsigned x) > > > > > { > > > > > unsigned ret = 0; > > > > > for (int i = 0; i < N; i++) > > > > > { > > > > > vect_b[i] = x + i; > > > > > if (vect_a[i]*2 != x) > > > > > break; > > > > > vect_a[i] = x; > > > > > > > > > > } > > > > > return ret; > > > > > } > > > > > > > > > > is transformed into > > > > > > > > > > unsigned test4(unsigned x) > > > > > { > > > > > unsigned ret = 0; > > > > > for (int i = 0; i < N; i++) > > > > > { > > > > > vect_b[i] = x + i; :: !(vect_a[i]*2 != x) > > > > > > > > why's this not 'true'? > > > > > > > > > > I think there are three cases here: > > > > > > 1. control flow with no exit == true, in this case we > > > can perform statements with side-effects in place as > > > you never exit the loop early. > > > 2. early break non-masked, this one requires the store > > > to be moved to the latch, or the block with the last > > > early break. This is what we do today, and the predicate > > > would cause it to be moved. > > > 3. early break masked, In this case I think we need an exit > > > block that performs any side effects masked. Since on every > > > exit you must still perform the stores, but based on the union > > > of the masks you have so far. The final one should still be moved > > > to the latch block. > > > > > > > > if (vect_a[i]*2 != x) :: true > > > > > break; :: vect_a[i]*2 != x > > > > > vect_a[i] = x; :: !(vect_a[i]*2 != x) > > > > > > > > > > } > > > > > return ret; > > > > > } > > > > > > > > > > A more complicated example: > > > > > > > > > > int foo (int n) > > > > > { > > > > > int res = 0; > > > > > for (int i = 0; i < n; i++) > > > > > { > > > > > y[i] = x[i] * 2; :: true > > > > > res += x[i] + y[i]; :: true > > > > > > > > > > if (x[i] > 5) :: true > > > > > break; :: x[i] > 5 > > > > > > > > > > if (z[i] > 5) :: !(x[i] > 5) > > > > > break; :: !(x[i] > 5) && (z[i] > 5) > > > > > } > > > > > > > > > > return res; > > > > > } > > > > > > > > > > any statement guarded by a block with a non-true predicate can be > > > > > simplified, > > so > > > > > the above can be turned into > > > > > > > > > > int foo (int n) > > > > > { > > > > > int res = 0; > > > > > for (int i = 0; i < n; i++) > > > > > { > > > > > y[i] = x[i] * 2; :: true > > > > > res += x[i] + y[i]; :: true > > > > > > > > > > if (x[i] > 5) :: true > > > > > break; :: x[i] > 5 > > > > > > > > > > if (z[i] > 5) :: !(x[i] > 5) > > > > > break; :: (z[i] > 5) > > > > > } > > > > > > > > > > return res; > > > > > } > > > > > > > > > > if we make the predicates a *gimple, where true is just NULL; > > > > > > > > I'd make predicates a SSA name instead (or possibly abstract this > > > > so we can more easily change our minds) - the disadvantage is > > > > that gcond * doesn't have a SSA def. > > > > > > > > We might want to look into tree-if-conv.cc (remotely possibly into > > > > gimple-predicate-analysis.{h,cc}) since there are multiple places in > > > > GCC where we need to maintain a composition of individual predicates > > > > (and possibly compare / simplify such compositions). > > > > > > > > > > Aha, good shout. > > > > > > > > Lastly examples such as: > > > > > > > > > > unsigned test4(unsigned x, unsigned y) > > > > > { > > > > > unsigned ret = 0; > > > > > unsigned sum = 0; > > > > > for (int i = 0; i < N; i++) > > > > > { > > > > > vect_b[i] = x + i; > > > > > if (vect_a[i] > x) > > > > > return vect_a[i]; > > > > > > > > > > vect_b[i] += x + i; > > > > > if (vect_a[i] > y) > > > > > return sum; > > > > > sum += vect_a[i]; > > > > > vect_a[i] = x; > > > > > } > > > > > return ret; > > > > > } > > > > > > > > > > are tagged as: > > > > > > > > > > unsigned test4(unsigned x, unsigned y) > > > > > { > > > > > unsigned ret = 0; > > > > > unsigned sum = 0; > > > > > for (int i = 0; i < N; i++) > > > > > { > > > > > vect_b[i] = x + i; :: !(vect_a[i] > y) > > > > > if (vect_a[i] > x) :: true > > > > > return vect_a[i]; :: vect_a[i] > x > > > > > > > > > > vect_b[i] += x + i; :: !(vect_a[i] > y) > > > > > if (vect_a[i] > y) :: !(vect_a[i] > x) > > > > > return sum; :: vect_a[i] > y > > > > > sum += vect_a[i]; :: !(vect_a[i] > y) > > > > > vect_a[i] = x; :: !(vect_a[i] > y) > > > > > } > > > > > return ret; > > > > > } > > > > > > > > I'll note you have all early-exit examples. Whatever we build should > > > > ideally able to handle cases we currently if-convert. > > > > > > > > > > Yeah, I think cases we currently if-convert and don't if-convert can be > > > handled approximately the same way. Though we might want to have > > > some kind of lowering phase that transforms the block into conditional > > > statements when we can. Probably in optimize_slp? > > > > But isn't this handled by the rewrite to predicated form? Of course > > you would need to handle PHIs then (with early exit we have/allow no > > in-loop PHIs), but those are simply COND_EXPRs (or blends when > > vectorized). > > > > > > > This representation has a number of benefits: > > > > > > > > > > 1. code-motion becomes just SLP scheduling, where scheduling has to > > > > > take > > > > account > > > > > not to schedule blocks across eachother which have a data > > > > > dependency. This > > > > > becomes simpler when we build the SLP tree we can merge blocks > > > > > with the > > > > same > > > > > control dependencies. > > > > > 2. this representation also helps for the case where we want to stay > > > > > inside the > > > > > vector loop, as the predicate tells us which mask we need to use > > > > > to compute > > > > > the reduction values. > > > > > 3. this representation is easily extendable to general control flow > > > > > support > > > > > since the statements will all be explicitly guarded by a > > > > > predicate. The PHI > > > > > node can be given a special type, to indicate to codegen that a > > > > > branch > > should > > > > > be generated. > > > > > > > > Indeed. > > > > > > > > Now, we're for example currently missing the dependence on the loop > > > > mask computation for fully-masked loops as we're not representing its > > > > computation nor the implicit uses on SLP nodes. > > > > > > > > Each predicate would be a dependence on the predicate computation > > > > itself, > > > > is that predicate computation supposed to be a SLP node? > > > > > > This is where I was hoping to get some feedback into Richard S's plans. > > > My understanding is that he'll be working on BB SLP support for VLA and it > > > looks to me like we could use the same bookkeeping here. > > > > I think you are talking about allowing a subset of active lanes from the > > real vector? For BB SLP the static constant mask is kind-of implicit > > with the number of lanes in the SLP node, the main issue we have here > > currently is that we have to mask (or fill with safe values) the inactive > > lanes when we have a "gap in the vector". But sure, we could trigger > > the masking logic itself by adding a all-ones mask (all lanes in the > > SLP node are active), but the main issue will be to support "gaps" > > at the end of vectors. > > > > > My initial thought was to indeed just have a new node similar to how > > > TWO_OPERANDS is implemented. Except instead of having a permute node > > > it would have a mask node. This should also be usable when we're > > > building > > > an SLP code from statements where we could build the tree if masking is > > > applied: > > > i.e. > > > > > > a[i] = b[i] > > > a[i+1] = b[i+1] * 5 > > > > > > should be SLP'able by masking out the second operand. > > > > > > I think an SLP node makes sense because we then don't need to keep a > > > separate > > CFG. > > > > Good. > > > > > > > > > > > conditions which read from the same sources can be merged (SLP'ed) if > > > > > the > > > > > statements in the early exit block are the same. > > > > > > > > > > That is, the above can be handled as: > > > > > > > > > > unsigned test4(unsigned x, unsigned y) > > > > > { > > > > > unsigned ret = 0; > > > > > unsigned sum = 0; > > > > > for (int i = 0; i < N; i++) > > > > > { > > > > > if (vect_a[i] > x) :: true > > > > > return vect_a[i]; :: vect_a[i] > x > > > > > > > > > > if (vect_a[i] > y) :: !(vect_a[i] > x) > > > > > return sum; :: vect_a[i] > y > > > > > > > > > > vect_b[i] = x + i; :: !(vect_a[i] > y) > > > > > vect_b[i] += x + i; :: !(vect_a[i] > y) > > > > > sum += vect_a[i]; :: !(vect_a[i] > y) > > > > > vect_a[i] = x; :: !(vect_a[i] > y) > > > > > } > > > > > return ret; > > > > > } > > > > > > > > > > But the following: > > > > > > > > > > unsigned test4(unsigned x, unsigned y) > > > > > { > > > > > unsigned ret = 0; > > > > > unsigned sum = 0; > > > > > for (int i = 0; i < N; i++) > > > > > { > > > > > vect_b[i] = x + i; :: !(vect_a[i] > y) > > > > > if (vect_a[i] > x) :: true > > > > > break; :: vect_a[i] > x > > > > > > > > > > vect_b[i] += x + i; :: !(vect_a[i] > y) > > > > > if (vect_a[i] > y) :: !(vect_a[i] > x) > > > > > break; :: vect_a[i] > y > > > > > sum += vect_a[i]; :: !(vect_a[i] > y) > > > > > vect_a[i] = x; :: !(vect_a[i] > y) > > > > > } > > > > > return ret; > > > > > } > > > > > > > > > > becomes: > > > > > > > > > > unsigned test4(unsigned x, unsigned y) > > > > > { > > > > > unsigned ret = 0; > > > > > unsigned sum = 0; > > > > > for (int i = 0; i < N; i++) > > > > > { > > > > > if (vect_a[i] > x || vect_a[i] > y) :: true > > > > > break; :: vect_a[i] > x > > > > > > > > > > vect_b[i] = x + i; :: !(vect_a[i] > y) > > > > > vect_b[i] += x + i; :: !(vect_a[i] > y) > > > > > sum += vect_a[i]; :: !(vect_a[i] > y) > > > > > vect_a[i] = x; :: !(vect_a[i] > y) > > > > > } > > > > > return ret; > > > > > } > > > > > > > > > > In addition for this to work the vectorizer must be changed to > > > > > generate code > > > > > solely based on the SLP tree and not directly from the BB. > > > > > > > > Yep - see my comments on the paper above and about the existing issues > > > > with the BB vectorizer. I'll remind myself that in future SLP build > > > > should start with a single-lane SLP node for each SSA def so we'd have > > > > SLP nodes for both the scalar and the vector part of a BB which might > > > > "solve" this if we chose to never "schedule" scalar SLP nodes > > > > (we currently tend to create cyclic dependences in some cases). > > > > > > > > > Note: that the the control predicates should be easy to determine > > > > > through > > the > > > > > post-order dominance tree. > > > > > > > > The paper (and the talk) talks about loops and the need to represent > > > > them with a special "AST" node. I think without doing that this > > > > restricts > > > > us to non-loopy CFGs (like no outer loop vectorization)? > > > > > > Yeah, the outer loop vectorization approach from the paper also looked > > interesting. > > > The approach would theoretically allow for loop merging as well. Though > > > I think > > that > > > requires you to try BB SLP before loop vectorization. > > > > Note in the talk (I didn't read the paper thoroughly) they "cheat" and > > perform loop fusion after rewriting into conditional IL and only then > > perform SLP rather than doing the fusion on-demand. > > > > As said I'd first see predication as a way to get rid of loop > > if-conversion, merging that with the vectorizer itself, and of course > > to get early break supported with SLP. I'd look at expanding BB > > vectorization later - similar for the idea of rewriting SLP discovery > > (which predication will likely require to some extent). I'm not sure > > we want to go fully no-CFG, I'd rather have some SLP nodes "tied" to > > a spot in the CFG for now and get "CFG aware SLP scheduling" this way. > > I guess the problem here is that this way we won't be able to "SLP" CFG, > At the moment, in GIMPLE compound expressions get broken up into > multiple exits. So if (a[i] != 0 ||a[i+1] !=0) return; becomes two exits. > > The idea with going CFG-less is that we should be able to SLP the control > flow. We should also be able to re-arrange the order of exits to e.g. test > the cheapest one first.
Note that this is an optimization problem even for scalar code, having a unified view of conditions with control flow and (partially) if-converted sub-conditions and the ability to re-associate that and emit control dependent statements in correct places is something we cannot do there either. > So I think I agree that at least for version 0 we should > stick with requiring the CFG, no explicit CFG is probably better for future > versions? It's something worth exploring. > > For example dependence analysis relies on a vector load to be emitted > > at the earliers scalar load position of the set of loads vectorized > > and for the store to be emitted at the latest scalar store position > > of the set of stores vectorized. Without also representing memory > > data dependences in the SLP graph we have to preserve that > > (you'd have to add SLP edges for virtual operands). The paper doesn't > > really cover this detail very explicitly AFAICS (how memory dependences > > are preserved when re-writing the IL). > > Yeah, I noticed this gap as well, they don't seem to handle statements with > side-effects. So for that I had extended it with my own scheme. I believe > this essentially just requires different predicates for statements with > side-effects. > > At least on paper it works out, so it's a simple extension of the idea. Sure - you just need to model all kinds of dependences as extra predicates or data dependences. The simplified .MEM_n SSA is a linear number of edges but "true" dependences are of course quadratic. > > > > > > > > > > > > > > == Implementation == > > > > > > > > > > The high level steps for implementing this implementation and the > > > > > order I'll > > > > > work is: > > > > > > > > > > 1. Detach SLP codegen from BB > > > > > 2. Teach build SLP to assign control predicates to relevant > > > > > statements. It might > > > > > be easier to do this during loop analysis, but I don't think > > > > > that'll work > > > > > for BB SLP. Thoughts? > > > > > 3. Teach optimize SLP to merge blocks with similar control predicates > > > > > 4. Teach SLP scheduling to move blocks > > > > > 5. Teach vectorizable_early_break to produce new CFG > > > > > > > > Let me ask you to start with 0. the minimum viable "hack" to make > > > > SLP of early break work. For 1. to 5. I'd first concentrate on loop SLP > > > > vectorization because there "detached codegen" is trivial (ha, joking > > > > of course). I'd probably go as far as to re-do SLP discovery between > > > > 0. and tackling 1. - SLP discovery for loops, that is. > > > > > > Sure, that's fair enough. I have some Arm specific patches to get out > > > first, > > > which I'll do this week and can start on this next Tuesday. > > > > > > I just wanted to outline what I'm thinking for the longer term plan. > > > > Sounds good. Just to re-iterate my plan is to separate SLP discovery > > for loop vectorization and BB vectorization - I'll motivate that > > with some of the SLP-only patches that are queued to after when I > > have managed to make myself happy with the load/store permute handling ... > > > > > > > > > > I think for 0. we want to be able to attach a [set of] predicates to > > > > each SLP node. With that we can represent loop masking and early > > > > break conditions. Those [set of] predicates would be an alternate > > > > unsorted vector of "SLP children" and the SLP children would be > > > > the predicate computation SLP nodes. We'd have the chance to > > > > combine the set of predicates down to one by adding intermediate > > > > merging SLP nodes (in the end code generation just accepts one > > > > predicate but discovery for simplicity(?) would add many, or two). > > > > > > > > > > So if I understood correctly, during discovery do we maintain breaks > > > in the same instance or do we represent them using these merge nodes > > > which are essentially a sort of PHI node? So they represent your branch > > > in control flow and the predicates the jump locations? > > > > I think stmts after a break will pick up the predicate SLP nodes from > > the break but yes, the break stmt itself (the cbranch) would be the > > seed of a separate SLP instance. Caching of the predicate SLP > > nodes should probably be tied to the basic-block (similar to how > > if-conversion does this). Whether to build SLP nodes to "and" > > two exit predicates or whether to record them in a vector of > > and_predicates in each SLP node is probably something that experiments > > will tell - having a representation of predicates not "lowered" to > > SLP operations is probably more convenient for simplification and/or > > merging with loop mask predicates, but for codegen and costing having > > the actual predicate merging ops spelled out is prefered. > > > > > > For early break and without 1. we'd have to "fix" the position of > > > > some SLP nodes in the CFG, namely the "cbranch" SLP node (maybe > > > > that's just a predicate compute node of special kind). I think > > > > I mentioned adding an optional gimple_stmt_iterator to SLP nodes > > > > to support this (or special-case the cbranch node during scheduling > > > > by looking at the gcond * that's inevitably there). The loop > > > > mask compute would get a gsi_after_labels for example. > > > > > > > > In principle if we have this as 0. then 1. should become just > > > > an engineering exercise. > > > > > > OK, and so if I understand correctly, code motion becomes a matter of > > > pushing nodes down the instance until the merge predicate simplifies? > > > > > > Or do we keep them in place, and just materialize them in the correct > > > BB? > > > > > > I imagine that for 0 not doing 1 would be easier as we can just re-use > > > the scalar BB as is today and not have to go down the path of being > > > able to generate new CFG yet. > > > > Yes, as said I'd first do only 0 and record fixed scheduling points > > for SLP nodes where that's necessary (like for the cbranch node). > > It might be that the code motion that's required for correctness(!) > > happens "magically", but I'm not sure we want to rely on that ;) > > ISTR we only push stmts down after the exits, so predicating them > > with the in-loop block predicate should get them scheduled appropriately? > > > > > > > > > > I'm trying to make you focus on 0. as I'm really eager to make > > > > the only-SLP transition this stage1 and doing all the neat stuff > > > > will take longer. > > > > > > That is fair, I'll move it up my priority list. I'm hoping to get as much > > > of > > > this done during this stage-1 (spent a lot of time thinking about it). > > > > > > So I'll get my AArch64 patches out and start on 0. I'll continue on > > > IV opts in the background. > > > > Thanks a lot. > > > > > > > > > > > == Switch vectorization support == > > > > > > > > > > The following code > > > > > > > > > > short a[100]; > > > > > > > > > > int foo(int n, int counter) > > > > > { > > > > > for (int i = 0; i < n; i++) > > > > > { > > > > > if (a[i] == 1 || a[i] == 2 || a[i] == 7 || a[i] == 4) > > > > > return 1; > > > > > } > > > > > return 0; > > > > > } > > > > > > > > > > fails to vectorize at -O3 -march=armv9-a because in GIMPLE the if is > > > > > rewritten > > > > > into a switch statement: > > > > > > > > > > <bb 2> [local count: 114863530]: > > > > > if (n_6(D) > 0) > > > > > goto <bb 6>; [94.50%] > > > > > else > > > > > goto <bb 11>; [5.50%] > > > > > > > > > > <bb 6> [local count: 108546036]: > > > > > > > > > > <bb 3> [local count: 1014686025]: > > > > > # i_3 = PHI <i_8(7), 0(6)> > > > > > _1 = a[i_3]; > > > > > switch (_1) <default: <L9> [94.50%], case 1 ... 2: <L11> [5.50%], > > > > > case 4: > > <L11> > > > > [5.50%], case 7: <L11> [5.50%]> > > > > > > > > > > <bb 9> [local count: 55807731]: > > > > > <L11>: > > > > > goto <bb 5>; [100.00%] > > > > > > > > > > <bb 4> [local count: 958878295]: > > > > > <L9>: > > > > > i_8 = i_3 + 1; > > > > > if (n_6(D) > i_8) > > > > > goto <bb 7>; [94.50%] > > > > > else > > > > > goto <bb 12>; [5.50%] > > > > > > > > > > <bb 12> [local count: 52738306]: > > > > > goto <bb 8>; [100.00%] > > > > > > > > > > <bb 7> [local count: 906139989]: > > > > > goto <bb 3>; [100.00%] > > > > > > > > > > <bb 11> [local count: 6317494]: > > > > > > > > > > <bb 8> [local count: 59055800]: > > > > > > > > > > <bb 5> [local count: 114863531]: > > > > > # _5 = PHI <1(9), 0(8)> > > > > > <L10>: > > > > > return _5; > > > > > > > > > > However such switch statements, where all the entries lead to the same > > > > > destination are easy to vectorize. In SVE we have the MATCH > > > > > instruction that > > > > > can be used here, and for others we can duplicate the constants and > > > > > lower > > the > > > > > switch to a series of compare and ORRs. > > > > > > > > > > This is similar as what's done for when the values don't fit inside a > > > > > switch: > > > > > > > > > > short a[100]; > > > > > > > > > > int foo(int n, int counter, short x, short b, short c) > > > > > { > > > > > for (int i = 0; i < n; i++) > > > > > { > > > > > if (a[i] == x || a[i] == b || a[i] == 7 || a[i] == c) > > > > > return 1; > > > > > } > > > > > return 0; > > > > > } > > > > > > > > > > is vectorized as: > > > > > > > > > > .L4: > > > > > incb x5 > > > > > incw z30.s, all, mul #2 > > > > > cmp w6, w1 > > > > > bcc .L15 > > > > > .L6: > > > > > ld1h z31.h, p7/z, [x5] > > > > > cmpeq p15.h, p7/z, z31.h, z27.h > > > > > cmpeq p11.h, p7/z, z31.h, z28.h > > > > > cmpeq p14.h, p7/z, z31.h, #7 > > > > > cmpeq p12.h, p7/z, z31.h, z29.h > > > > > orr p15.b, p7/z, p15.b, p11.b > > > > > orr p14.b, p7/z, p14.b, p12.b > > > > > inch x1 > > > > > orr p15.b, p7/z, p15.b, p14.b > > > > > ptest p13, p15.b > > > > > b.none .L4 > > > > > umov w1, v30.s[0] > > > > > .L3: > > > > > sxtw x1, w1 > > > > > b .L7 > > > > > > > > > > which is great! but should be an SVE MATCH instruction as well. > > > > > > > > > > This kind of code shows up through the use of std::find_if as well: > > > > > > > > > > #include <bits/stdc++.h> > > > > > using namespace std; > > > > > > > > > > bool pred(int i) { return i == 1 || i == 2 || i == 7 || i == 4; } > > > > > > > > > > int foo(vector<int> vec) > > > > > { > > > > > vector<int>::iterator it; > > > > > > > > > > it = find_if(vec.begin(), vec.end(), pred); > > > > > > > > > > return *it; > > > > > } > > > > > > > > > > and once the unrolled loop is gone we should be able to vectorize it. > > > > > > > > > > My proposal is to create two new IFNs and optabs: > > > > > > > > > > IFN_MATCH, IFN_NMATCH and have a vectorizer pattern recognize the ORR > > > > reduction > > > > > cases into MATCH and have ifcvt lower the switch into ORR statements. > > > > > > > > > > Such switch statements only have two branches and so are identical to > > cbranch > > > > > for codegen and vectorization. > > > > > > > > > > The unrolled ORR are replace by match and so the gimple becomes: > > > > > > > > > > a = .MATCH (...) > > > > > if (a != {0,..}) > > > > > and so no other change is needed for codegen. > > > > > > > > This sounds separate from the rest, I think teaching if-conversion to > > > > lower (small, or simple by some measure) switches is the way to go > > > > for now. There's existing code to produce ifs from switch and those > > > > can be if-converted in the classical way (in the if (vectorized) loop > > > > copy only, of course). > > > > > > Ack, it was just something we noticed in some cases which blocked > > > vectorization > > > which seemed easy to address 😊 > > > > > > > > > > > An alternative with IFN_MATCH sounds good in principle. > > > > > > > > > == Better cbranch support == > > > > > > > > > > At the moment, one of the big downsides of re-using the existing > > > > > cbranch is > > that > > > > > in the target we can't tell whether the result of the cbranch is > > > > > actually needed > > > > > or not. > > > > > > > > > > i.e. for SVE we can't tell if the predicate is needed. For the cases > > > > > where we > > > > > don't stay inside the vector loop we can generate more efficient code > > > > > if we > > know > > > > > that the loop only cares about any or all bits set and doesn't need > > > > > to know > > > > > which one. > > > > > > > > > > For this reason I propose adding new optabs cbranch_any_ and > > > > > branch_all_ > > and > > > > > have emit_cmp_and_jump_insns lower to these when appropriate. > > > > > > > > Hmm, but isn't this then more a vec_cmpeq_any that produces a scalar > > > > rather than a vector and then a regular scalar compare-and-jump? That > > > > is, > > > > does SVE have such a compare instruction? Can you show the current > > > > code-gen and how the improved one would look like? > > > > > > Ah so the point here is that we don't need the scalar results, the optabs > > > would > > only > > > care about the CC flags. > > > > > > Consider the following loop: > > > > > > #ifndef N > > > #define N 800 > > > #endif > > > unsigned vect_a[N]; > > > unsigned vect_b[N]; > > > > > > unsigned test4(unsigned x) > > > { > > > unsigned ret = 0; > > > for (int i = 0; i < N; i++) > > > { > > > vect_b[i] = x + i; > > > if (vect_a[i]*2 != x) > > > break; > > > vect_a[i] = x; > > > > > > } > > > return ret; > > > } > > > > > > For early break, where we branch to scalar code we don't care about the > > > result of > > the mask, > > > as it's unused: > > > > > > so in gimple we have: > > > > > > mask_patt_6.15_69 = vect_cst__60 != vect__4.14_67; > > > vec_mask_and_72 = loop_mask_64 & mask_patt_6.15_69; > > > if (vec_mask_and_72 != { 0, ... }) > > > > > > which generates: > > > > > > cmpne p14.s, p7/z, z30.s, z29.s > > > > p7 is the loop mask here? p14.s the output of the masked compare? > > > > > ptest p15, p14.b > > > > and p15 is all-zero? > > P15 here is all true, in SVE cbranch emits a ptest, and in this case a ptrue > is fine > since we only care about 0 and != 0. > > I'll explain more below. > > > > > > b.none .L3 > > > > > > notice the ptest. In SVE if the size of a predicate of an operation does > > > not match > > that of > > > the governing predicate a ptest is required to reshape the flags. > > > > > > However, for early breaks all we case about are any or all. So the exact > > > shape of > > the predicate > > > register doesn't matter and since vector compares already set flags we > > > already > > have the answer > > > we need. So the above should be: > > > > > > cmpne p14.s, p7/z, z30.s, z29.s > > > b.none .L3 > > > > Hmm, I don't really see anything missing on the GIMPLE side - doesn't > > the required info all translate to RTL and thus instruction combine > > should be able to handle this? > > For SVE only we could potentially do it, but for using SVE instructions for > Adv. SIMD > we can't. Adv. SIMD requires us to compress the vector results into an > integer > and compare that with 0 to get the CC flags. SVE doesn't have this, because > in SVE > a vector compare sets flags. So in this case, we don't care about p14 at all. > > But if we say want to use an SVE compare to replace an Adv. SIMD compare the > sequence > of instructions will be longer than what combine can match. And it becomes > very harry.. > I've tried it before and because of the various combinations of compares (the > vector and the > Flag ones) It ends up with a lot of complicated patterns. Is that something worth optimizing? > > > > > This makes a big difference in loops for performance. We have a small CC > > optimization > > > pass in that tries to optimize such cases > > https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=5a783f42d77b2f00a1ed171c119 > > b020e8df8e521 > > > > > > But this case is hard to detect as you need to know that the predicate is > > > not used > > and we only > > > case about any/all. This is information we have in GIMPLE so can more > > > readily do > > in expand. > > > > But the ptest p15, p14.b is the same as vec_mask_and_72 != { 0, ... } > > so I don't see where information is missing? I probably don't > > understand the "shape" thing. > > So for SVE the usages of the compares for early break don't care about the > mask value. > that is, we don’t care about the predicate result of vec_mask_and_72 != { 0, > ... }. > > Because for SVE the vector compare sets the CC flags indicating one of 3 > options: > 1. none == result predicate is all false > 2. first == the first element in the predicate is true > 3. !last == the last element in the predicate is not true > > This gives us the ability for early break to just branch based on the CC, > since we just care > about none or !none, which is abstracted as the pseudo branch instructions > b.any and b.none. > > ptrue is used to reshape the CC into the correct form, when the size of > governing predicate > doesn't match that of the compare results. > > For none and any this doesn't matter, but it changes the meaning for first > and !last. But because > for early break we only require none and any, the ptrue is never needed. So why's it not "DCEd" then? I guess I still don't get it - it seems like an architectural detail GIMPLE should know nothing about and that you need to solve in the backend. Richard. > Now for Adv. SIMD what we typically do, when an SVE instruction can be used > in place of an Adv. SIMD > sequence we deal with it at expand time. But at early break that's no > possible because cbranch doesn't > see the compare. So we are forced to expand to the Adv. SIMD equivalent and > later try to match it up > with an SVE instruction. This results in more instructions than combine can > handle. The new optab > would allow us to do this easily as we don't ever have to emit the Adv. SIMD > code first. > > And for SVE it will tell us as well that the result value of the comparison > is not needed. i.e. it coveys > that all we care about is branching based on a compare, how you do it, is up > to you. > > Cheers, > Tamar > > > > > Richard. > > > > > The additional benefit of the optab is that it allows us to more easily > > > use SVE > > instructions > > > to do the flag setting when we choose Adv. SIMD but SVE is available. > > > Since we > > know we don't > > > need to materialize an Adv. SIMD Boolean vector we can avoid the > > materialization of the conversion > > > from an SVE predicate to Adv. SIMD mask vector. > > > > > > Cheers, > > > Tamar > > > > > > > > > > > > Are the general idea and steps OK? > > > > > > > > See above. Thanks for the write-up. > > > > > > > > Richard. > > > > > > > > > If so I'll start implementation now. > > > > > > > > > > Thanks, > > > > > Tamar > > > > > > > > > > [1] Yishen Chen, Charith Mendis, and Saman Amarasinghe. 2022. All > > > > > You Need Is Superword-Level Parallelism: Systematic Control-Flow > > > > > Vectorization with SLP. In Proceedings of the 43rd ACM SIGPLAN > > > > > International Conference on Programming Language Design and > > > > > Implementation (PLDI '22), June 13?17, 2022, San Diego, CA, USA. > > > > > ACM, NewYork, NY, USA, 15 pages. https://doi.org/10.1145/3519939. > > > > > 3523701 > > > > > > > > > > [2] Predicated Static Single Assignment > > > > > > > > > > Lori Carter Beth Simon Brad Calder Larry Carter Jeanne Ferrante > > > > > Department of Computer Science and Engineering > > > > > University of California, San Diego > > > > > flcarter,esimon,calder,carter,ferran...@cs.ucsd.edu > > > > > > > > > > > > > -- > > > > Richard Biener <rguent...@suse.de> > > > > SUSE Software Solutions Germany GmbH, > > > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG > > Nuernberg) > > > > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)