Hi Richi, > -----Original Message----- > From: Richard Biener <rguent...@suse.de> > Sent: Friday, October 11, 2019 8:02 AM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; o...@ucw.cz > Subject: Re: [PATCH 1/2][GCC][RFC][middle-end]: Add SLP pattern matcher. > > On Tue, 8 Oct 2019, Tamar Christina wrote: > > > Hi Richi, > > > > Thanks for the review, I've added some comments inline. > > > > The 10/07/2019 12:15, Richard Biener wrote: > > > On Tue, 1 Oct 2019, Tamar Christina wrote: > > > > > > > Hi All, > > > > > > > > This adds a framework to allow pattern matchers to be written at > > > > based on the SLP tree. The difference between this one and the > > > > one in tree-vect-patterns is that this matcher allows the matching > > > > of an arbitrary number of parallel statements and replacing of an > arbitrary number of children or statements. > > > > > > > > Any relationship created by the SLP pattern matcher will be undone if > SLP fails. > > > > > > > > The pattern matcher can also cancel all permutes depending on what > > > > the pattern requested it to do. As soon as one pattern requests > > > > the permutes to be cancelled all permutes are cancelled. > > > > > > > > Compared to the previous pattern matcher this one will work for an > > > > arbitrary group size and will match at any arbitrary node in the > > > > SLP tree. The only requirement is that the entire node is matched or > rejected. > > > > > > > > vect_build_slp_tree_1 is a bit more lenient in what it accepts as > > > > "compatible operations" but the matcher cannot be because in cases > > > > where you match the order of the operands may be changed. So all > operands must be changed or none. > > > > > > > > Furthermore the matcher relies on canonization of the operations > > > > inside the SLP tree and on the fact that floating math operations are > > > > not > commutative. > > > > This means that matching a pattern does not need to try all > > > > alternatives or combinations and that arguments will always be in > > > > the same order if it's the same operation. > > > > > > > > The pattern matcher also ignored uninteresting nodes such as type > > > > casts, loads and stores. Doing so is essential to keep the runtime > > > > down. > > > > > > > > Each matcher is allowed a post condition that can be run to > > > > perform any changes to the SLP tree as needed before the patterns > > > > are created and may also abort the creation of the patterns. > > > > > > > > When a pattern is matched it is not immediately created but > > > > instead it is deferred until all statements in the node have been > > > > analyzed. Only if all post conditions are true, and all > > > > statements will be replaced will the patterns be created in batch. > > > > This allows us to not have to undo any work if the pattern fails but > > > > also > makes it so we traverse the tree only once. > > > > > > > > When a new pattern is created it is a marked as a pattern to the > > > > statement it is replacing and be marked as used in the current SLP > > > > scope. If SLP fails then relationship is undone and the relevancy > restored. > > > > > > > > Each pattern matcher can detect any number of pattern it wants. > > > > The only constraint is that the optabs they produce must all have the > same arity. > > > > > > > > The pattern matcher supports instructions that have no scalar form > > > > as they are added as pattern statements to the stmt. The BB is > > > > left untouched and so the scalar loop is untouched. > > > > > > > > Bootstrapped on aarch64-none-linux-gnu and no issues. > > > > No regression testing done yet. > > > > > > If you split out the introduction of SLP_TREE_REF_COUNT you can > > > commit that right now (sorry for being too lazy there...). > > > > > > > I'll split those off :) > > > > > One overall comment - you do pattern matching after SLP tree > > > creation (good) but still do it before the whole SLP graph is > > > created (bad). Would it be possible to instead do it as a separate > > > phase in vect_analyze_slp, looping over all instances (the instances > > > present entries into the single unified SLP graph now), avoiding to > > > visit "duplicates"? > > > > > > > It should be, the only issue I can see is that build SLP may fail > > because of an unsupported permute, or because it can use load lanes. > > If I'm understanding it correctly you wouldn't get SLP vectorization > > in those cases so then the matching can't work? So it would limit it a it > > more. > > True. As said later ideally the pattern matching woudl happen before or > rather as part of SLP building so that the operations then actually match (and > we then don't need that plus/minus SLP_TREE_TWO_OPERATORS handling). > At the moment it sits in a quite awkward place which may contribute to its > complexity. > > > > In patch 1/2 I see references (mostly in variable names) to > > > "complex" which is IMHO too specific. > > > > > > > Sorry, missed those during my cleanup. > > > > > I'd also think that a useful first pattern to support would be what > > > we do with SLP_TREE_TWO_OPERATORS, where code generation inserts > > > extra blends. I have yet to dive into the complex pattern details > > > to see if that's feasible or if you maybe re-use that... > > > > Oh, I hadn't thought of that. I'll take a look. > > > > > You seem to at least rely on the SLP build succeeding with the mixed > > > plus/minus cases? Which also restricts the kind of patterns we can > > > recognize I guess. Plus it shows the chicken-and-egg issue here - > > > we'd like to detect the patterns _first_ and then build the SLP > > > trees (or rather combine lanes into larger groups). > > > Doing it after the fact makes it no more powerful than doing it as > > > folding post vectorization? > > > > It's true that I do rely on build SLP succeeding, and there is one > > case I know of where SLP build fails when I didn't expect it to. But > > I was operating under the impression that that's just a case that needs > better support in SLP build. > > > > > > There were two other approaches I had considered here before landing on > this one: > > > > 1) Do the pattern matching as part of SLP build. This would get around the > issue that > > SLP build would have failed when the pattern could apply, but comes with > a much higher > > overhead on SLP build as this now needs to back-track or keep a list of > possible alternatives. > > either way you're going to lose more in space and/or time over doing the > pattern match post build. > > The FMA/FMS cases are matching a subtree essentially. > > > > 2) Doing it post vectorization is also more work as you have to rewrite all > uses and defs. > > if I'm not mistaken I have to keep the SSA names matching at that point. > But also there are/can be > > target specific gimple coming out of vectorizable_*, like Arm target's > > use > of .LOAD_LANES instead of > > a load and permute. This means I would have to handle target specific > things as well instead of > > doing it at the SLP tree level where I'd have less issues. At least for > > basic > things like loads. > > > > Also you could have the case where the target doesn't support a > particular permute but does support > > an instruction that impicitly has the permute, in which can vectorization > won't happen. > > You could in principle use the pattern matcher to work around this. > > > > And lastly, (I'm not 100% sure about this one) but doing it post > vectorization means you've done it > > after cloning no? Usually the use of these instructions results in an > iteration doing half the number > > of operations each iteration. The permuted version usually end up > needing larger loads, which means > > that I'd have to maintain the amount of iterations as that's fixed at > > that > point no? > > True. > > > That is to say, the permuted version on a complex double pair for > > instance > has to load two complex > > numbers in order to get the pair or real and imaginare numbers to do the > operation on. So it processes > > 32-bytes each iteration. The version with the instruction only needs a > single number at a time, so it > > can do a normal 16-byte load. > > > > The benefit here is that you then don't need a trailing loop as the > compiler sees you always have a > > multiple of 16-bytes in your iteration. > > > > I may be misunderstanding this, but doing it post vectorization means I'd > have to do two 16-byte loads > > since it's too late to change the iteration count? > > Yes. > > > This also means that a hand unrolled loop doing e.g. two complex > additions with a 90* rotation > > ends up with 4 loads instead of just the one (as doing before cloning > makes it not clone as it can merge > > the operations into one.). But I may be way off here.. This is my > > current understanding of the code :) > > > > > > +typedef hash_map <stmt_vec_info, slp_tree> > > > + ssa_name_def_to_slp_tree_map_t; > > > + > > > > > > this seems to be unused. > > > > > > + FOR_EACH_VEC_ELT_FROM (scalar_stmts, i, stmt_info, n - 1) > > > + { > > > ... > > > + work.stmt_infos = scalar_stmts.begin () + (i - (n - 1)); > > > + work.idx = i; > > > > > > so this tries to match patterns on ARITY arbitrary aligned but > > > adjacent scalar stmts [in a brute force way]. But then you > > > immediately fail when one matching fails. So I think this loop > > > should be written like > > > > > > for (unsigned i = n - 1; i < scalar_stmts.length (); i += n) > > > { > > > ... > > > > > > to make this fact clearer. A missed optimization here is to > > > consider pre/post shuffling of the group to make more cases match. > > > > > > In this loop you also do the target support check which could > > > possibly be done upfront in the pattern matcher itself to save > > > compile-time? It also seems to be required that patterns match a > > > single IFN call? > > > > > > > The patterns can return any number of IFNs, for instance the FMA also > > detects CONJ_FMA and FMS since the calculations are quite similar and > only differ in one node. > > > > > Looking at the complex patterns I am worried and confused about the > > > transform phase - just quoting/commenting on random pieces: > > > > > > + FOR_EACH_VEC_ELT (scalar_stmts, i, scalar_stmt) > > > + { > > > + if (defs.contains (scalar_stmt)) > > > + { > > > > > > this is quadratic - vec::contains does a linear search. > > > > True, I figured it wasn't an issue since defs will always be quite > > small, it's always 2 * IFN arity, but... > > > > > > > > arg_map->put (scalar_stmt, vect_split_slp_tree (node, i)); > > > + found_p = true; > > > > > > it seems that you are re-doing the match here, something that should > > > have been done in the first phase of pattern matching already. > > > > ... You are right. I no longer need to do this here. I missed this during > cleanup. > > > > > May I suggest to restructure the pattern matchers in a way that you > > > have a > > > > > > class slp_pattern > > > { > > > virtual match() = 0; > > > virtual transform() = 0; > > > }; > > > > > > and derive from that so you can have a pattern specific state you > > > can transfer from match to transform? Iteration over patterns then > > > either becomes ad-hoc or you find a way to iterate over an "array of > > > types" with our C++04 features creating a new instance when you > > > match to hold this state? > > > > > > > Ah yeah, Thanks! I keep forgetting that GCC uses C++ code :) > > ;) > > > > I also wonder what you are doing here - shouldn't it be "simply" > > > replacing a subgraph of the SLP tree with a single new SLP node that > > > now has those IFNs as "scalar" stmts (they are not really scalars > > > anymore because of that arity issue). This also means that > > > code-generation might better not go the "traditional" > > > way but instead we use a new vect_transform_slp_pattern function > > > which does the natural thing and we'll just have the pattern IFN > > > recorded directly in the slp_tree structure? (I realize the > > > complication of vect_get_slp_defs using the scalar stmts to identify > > > vectorized operands defs) > > > > > > That said, I still think the same result can be achieved by > > > post-vectorizer pattern matching. I also think that doing pattern > > > matching after the SLP tree build is backwards. > > > My vision is that we'd do more general graph matching on the SSA > > > graph forming the SLP tree rather than the current ad-hoc matching > > > starting from special "sinks". > > > > > > > This was also one of the original things I looked at, though I had > > figured the amount of work you have to do to do the subgraph matching > > is a lot more complicated. I couldn't really think of any way of > > minimizing back tracking etc into something reasonably efficient. > > > > Matching post build means I don't really have to backtrack because I > > can use assumptions of vect_get_slp_defs to my advantage and just > > access each child node in constant time. > > Yeah, I realize how you've done it is probably at the easiest and most > flexible > point. Still I fear we're engineering us into a corner here :/ > > You know I'm in the process of re-doing the vectorizer in terms of SLP-only. > For the SLP build to better support BB vectorization and partial loop > vectorization my idea is to start with a set of SLP instances 1:1 mapping the > SSA graph (SLP instances are acutally the entries into the now shared SLP > graph - it's no longer a tree) and then perform SLP node merging, forming > SLP nodes with group sizes > 1. During that, or rather as part of that, SLP > pattern matching would come in, merging some of the SLP nodes, forming > the group-size 2 complex math opts.
Ah yeah it would indeed fit here nicely since it wouldn't have to do the mini partitioning itself at that time. > > But rewriting SLP build is quite late on the plan of the rewrite... > so in the current SLP build the pattern matching would hook into > vect_build_slp_tree_1 when we discover alt_stmt_code. Of course as you > say it's likely more complex... > > I wonder if re-synthesizing complex-type operations in regular pattern > matching would help? > > That said - I'm fine with going with your approach but only if you promise to > help once I run into it during the SLP rewrite... Of course!, more then happy to help! Thanks! I'll make the other changes and send up the full patch series then. Thanks, Tamar > > Thanks, > Richard. > > > Kind Regards, > > Tamar > > > > > Thanks, > > > Richard. > > > > > > > > > > Thanks, > > > > Tamar > > > > > > > > gcc/ChangeLog: > > > > > > > > 2019-10-01 Tamar Christina <tamar.christ...@arm.com> > > > > > > > > * tree-vect-loop.c (vect_dissolve_slp_only_patterns): New. > > > > (vect_dissolve_slp_only_groups): Use macro. > > > > * tree-vect-patterns.c (vect_mark_pattern_stmts): Expose symbol. > > > > * tree-vect-slp.c (vect_free_slp_tree): Add control of > > > > recursion and > how > > > > to free. > > > > (ssa_name_def_to_slp_tree_map_t): New. > > > > (vect_create_new_slp_node, vect_build_slp_tree): Use macro. > > > > (vect_create_slp_patt_stmt): New. > > > > (vect_match_slp_patterns_2): New. > > > > (vect_match_slp_patterns): New. > > > > (vect_analyze_slp_instance): Call vect_match_slp_patterns and > undo > > > > permutes. > > > > (vect_detect_hybrid_slp_stmts): Dissolve relationships created > > > > for > SLP. > > > > * tree-vectorizer.h (SLP_TREE_REF_COUNT): New. > > > > (vect_mark_pattern_stmts): New. > > > > > > > > > > > > > > -- > > > Richard Biener <rguent...@suse.de> > > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 > > > Nuernberg, Germany; GF: Felix Imendörffer; HRB 247165 (AG München) > > > > > > -- > Richard Biener <rguent...@suse.de> > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 > Nuernberg, Germany; GF: Felix Imendörffer; HRB 247165 (AG München)