Hi Richi,

> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Friday, October 11, 2019 8:02 AM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; o...@ucw.cz
> Subject: Re: [PATCH 1/2][GCC][RFC][middle-end]: Add SLP pattern matcher.
> 
> On Tue, 8 Oct 2019, Tamar Christina wrote:
> 
> > Hi Richi,
> >
> > Thanks for the review, I've added some comments inline.
> >
> > The 10/07/2019 12:15, Richard Biener wrote:
> > > On Tue, 1 Oct 2019, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > This adds a framework to allow pattern matchers to be written at
> > > > based on the SLP tree.  The difference between this one and the
> > > > one in tree-vect-patterns is that this matcher allows the matching
> > > > of an arbitrary number of parallel statements and replacing of an
> arbitrary number of children or statements.
> > > >
> > > > Any relationship created by the SLP pattern matcher will be undone if
> SLP fails.
> > > >
> > > > The pattern matcher can also cancel all permutes depending on what
> > > > the pattern requested it to do.  As soon as one pattern requests
> > > > the permutes to be cancelled all permutes are cancelled.
> > > >
> > > > Compared to the previous pattern matcher this one will work for an
> > > > arbitrary group size and will match at any arbitrary node in the
> > > > SLP tree.  The only requirement is that the entire node is matched or
> rejected.
> > > >
> > > > vect_build_slp_tree_1 is a bit more lenient in what it accepts as
> > > > "compatible operations" but the matcher cannot be because in cases
> > > > where you match the order of the operands may be changed.  So all
> operands must be changed or none.
> > > >
> > > > Furthermore the matcher relies on canonization of the operations
> > > > inside the SLP tree and on the fact that floating math operations are 
> > > > not
> commutative.
> > > > This means that matching a pattern does not need to try all
> > > > alternatives or combinations and that arguments will always be in
> > > > the same order if it's the same operation.
> > > >
> > > > The pattern matcher also ignored uninteresting nodes such as type
> > > > casts, loads and stores.  Doing so is essential to keep the runtime 
> > > > down.
> > > >
> > > > Each matcher is allowed a post condition that can be run to
> > > > perform any changes to the SLP tree as needed before the patterns
> > > > are created and may also abort the creation of the patterns.
> > > >
> > > > When a pattern is matched it is not immediately created but
> > > > instead it is deferred until all statements in the node have been
> > > > analyzed.  Only if all post conditions are true, and all
> > > > statements will be replaced will the patterns be created in batch.
> > > > This allows us to not have to undo any work if the pattern fails but 
> > > > also
> makes it so we traverse the tree only once.
> > > >
> > > > When a new pattern is created it is a marked as a pattern to the
> > > > statement it is replacing and be marked as used in the current SLP
> > > > scope.  If SLP fails then relationship is undone and the relevancy
> restored.
> > > >
> > > > Each pattern matcher can detect any number of pattern it wants.
> > > > The only constraint is that the optabs they produce must all have the
> same arity.
> > > >
> > > > The pattern matcher supports instructions that have no scalar form
> > > > as they are added as pattern statements to the stmt.  The BB is
> > > > left untouched and so the scalar loop is untouched.
> > > >
> > > > Bootstrapped on aarch64-none-linux-gnu and no issues.
> > > > No regression testing done yet.
> > >
> > > If you split out the introduction of SLP_TREE_REF_COUNT you can
> > > commit that right now (sorry for being too lazy there...).
> > >
> >
> > I'll split those off :)
> >
> > > One overall comment - you do pattern matching after SLP tree
> > > creation (good) but still do it before the whole SLP graph is
> > > created (bad).  Would it be possible to instead do it as a separate
> > > phase in vect_analyze_slp, looping over all instances (the instances
> > > present entries into the single unified SLP graph now), avoiding to
> > > visit "duplicates"?
> > >
> >
> > It should be, the only issue I can see is that build SLP may fail
> > because of an unsupported permute, or because it can use load lanes.
> > If I'm understanding it correctly you wouldn't get SLP vectorization
> > in those cases so then the matching can't work? So it would limit it a it 
> > more.
> 
> True.  As said later ideally the pattern matching woudl happen before or
> rather as part of SLP building so that the operations then actually match (and
> we then don't need that plus/minus SLP_TREE_TWO_OPERATORS handling).
> At the moment it sits in a quite awkward place which may contribute to its
> complexity.
> 
> > > In patch 1/2 I see references (mostly in variable names) to
> > > "complex" which is IMHO too specific.
> > >
> >
> > Sorry, missed those during my cleanup.
> >
> > > I'd also think that a useful first pattern to support would be what
> > > we do with SLP_TREE_TWO_OPERATORS, where code generation inserts
> > > extra blends.  I have yet to dive into the complex pattern details
> > > to see if that's feasible or if you maybe re-use that...
> >
> > Oh, I hadn't thought of that. I'll take a look.
> >
> > > You seem to at least rely on the SLP build succeeding with the mixed
> > > plus/minus cases?  Which also restricts the kind of patterns we can
> > > recognize I guess.  Plus it shows the chicken-and-egg issue here -
> > > we'd like to detect the patterns _first_ and then build the SLP
> > > trees (or rather combine lanes into larger groups).
> > > Doing it after the fact makes it no more powerful than doing it as
> > > folding post vectorization?
> >
> > It's true that I do rely on build SLP succeeding, and there is one
> > case I know of where SLP build fails when I didn't expect it to.  But
> > I was operating under the impression that that's just a case that needs
> better support in SLP build.
> >
> >
> > There were two other approaches I had considered here before landing on
> this one:
> >
> > 1) Do the pattern matching as part of SLP build.  This would get around the
> issue that
> >    SLP build would have failed when the pattern could apply, but comes with
> a much higher
> >    overhead on SLP build as this now needs to back-track or keep a list of
> possible alternatives.
> >    either way you're going to lose more in space and/or time over doing the
> pattern match post build.
> >    The FMA/FMS cases are matching a subtree essentially.
> >
> > 2) Doing it post vectorization is also more work as you have to rewrite all
> uses and defs.
> >    if I'm not mistaken I have to keep the SSA names matching at that point.
> But also there are/can be
> >    target specific gimple coming out of vectorizable_*, like Arm target's 
> > use
> of .LOAD_LANES instead of
> >    a load and permute.  This means I would have to handle target specific
> things as well instead of
> >    doing it at the SLP tree level where I'd have less issues.  At least for 
> > basic
> things like loads.
> >
> >    Also you could have the case where the target doesn't support a
> particular permute but does support
> >    an instruction that impicitly has the permute, in which can vectorization
> won't happen.
> >    You could in principle use the pattern matcher to work around this.
> >
> >    And lastly, (I'm not 100% sure about this one) but doing it post
> vectorization means you've done it
> >    after cloning no? Usually the use of these instructions results in an
> iteration doing half the number
> >    of operations each iteration. The permuted version usually end up
> needing larger loads, which means
> >    that I'd have to maintain the amount of iterations as that's fixed at 
> > that
> point no?
> 
> True.
> 
> >    That is to say, the permuted version on a complex double pair for 
> > instance
> has to load two complex
> >    numbers in order to get the pair or real and imaginare numbers to do the
> operation on.  So it processes
> >    32-bytes each iteration.  The version with the instruction only needs a
> single number at a time, so it
> >    can do a normal 16-byte load.
> >
> >    The benefit here is that you then don't need a trailing loop as the
> compiler sees you always have a
> >    multiple of 16-bytes in your iteration.
> >
> >    I may be misunderstanding this, but doing it post vectorization means I'd
> have to do two 16-byte loads
> >    since it's too late to change the iteration count?
> 
> Yes.
> 
> >    This also means that a hand unrolled loop doing e.g. two complex
> additions with a 90* rotation
> >    ends up with 4 loads instead of just the one (as doing before cloning
> makes it not clone as it can merge
> >    the operations into one.).  But I may be way off here.. This is my
> > current understanding of the code :)
> > >
> > > +typedef hash_map <stmt_vec_info, slp_tree>
> > > +  ssa_name_def_to_slp_tree_map_t;
> > > +
> > >
> > > this seems to be unused.
> > >
> > > +  FOR_EACH_VEC_ELT_FROM (scalar_stmts, i, stmt_info, n - 1)
> > > +    {
> > > ...
> > > +      work.stmt_infos = scalar_stmts.begin () + (i - (n - 1));
> > > +      work.idx = i;
> > >
> > > so this tries to match patterns on ARITY arbitrary aligned but
> > > adjacent scalar stmts [in a brute force way].  But then you
> > > immediately fail when one matching fails.  So I think this loop
> > > should be written like
> > >
> > >  for (unsigned i = n - 1; i < scalar_stmts.length (); i += n)
> > >    {
> > > ...
> > >
> > > to make this fact clearer.  A missed optimization here is to
> > > consider pre/post shuffling of the group to make more cases match.
> > >
> > > In this loop you also do the target support check which could
> > > possibly be done upfront in the pattern matcher itself to save
> > > compile-time?  It also seems to be required that patterns match a
> > > single IFN call?
> > >
> >
> > The patterns can return any number of IFNs, for instance the FMA also
> > detects CONJ_FMA and FMS since the calculations are quite similar and
> only differ in one node.
> >
> > > Looking at the complex patterns I am worried and confused about the
> > > transform phase - just quoting/commenting on random pieces:
> > >
> > > +  FOR_EACH_VEC_ELT (scalar_stmts, i, scalar_stmt)
> > > +    {
> > > +      if (defs.contains (scalar_stmt))
> > > +       {
> > >
> > > this is quadratic - vec::contains does a linear search.
> >
> > True, I figured it wasn't an issue since defs will always be quite
> > small, it's always 2 * IFN arity, but...
> >
> > >
> > >          arg_map->put (scalar_stmt, vect_split_slp_tree (node, i));
> > > +         found_p = true;
> > >
> > > it seems that you are re-doing the match here, something that should
> > > have been done in the first phase of pattern matching already.
> >
> > ... You are right. I no longer need to do this here.  I missed this during
> cleanup.
> >
> > > May I suggest to restructure the pattern matchers in a way that you
> > > have a
> > >
> > >  class slp_pattern
> > >  {
> > >    virtual match() = 0;
> > >    virtual transform() = 0;
> > >  };
> > >
> > > and derive from that so you can have a pattern specific state you
> > > can transfer from match to transform?  Iteration over patterns then
> > > either becomes ad-hoc or you find a way to iterate over an "array of
> > > types" with our C++04 features creating a new instance when you
> > > match to hold this state?
> > >
> >
> > Ah yeah, Thanks! I keep forgetting that GCC uses C++ code :)
> 
> ;)
> 
> > > I also wonder what you are doing here - shouldn't it be "simply"
> > > replacing a subgraph of the SLP tree with a single new SLP node that
> > > now has those IFNs as "scalar" stmts (they are not really scalars
> > > anymore because of that arity issue).  This also means that
> > > code-generation might better not go the "traditional"
> > > way but instead we use a new vect_transform_slp_pattern function
> > > which does the natural thing and we'll just have the pattern IFN
> > > recorded directly in the slp_tree structure?  (I realize the
> > > complication of vect_get_slp_defs using the scalar stmts to identify
> > > vectorized operands defs)
> > >
> > > That said, I still think the same result can be achieved by
> > > post-vectorizer pattern matching.  I also think that doing pattern
> > > matching after the SLP tree build is backwards.
> > > My vision is that we'd do more general graph matching on the SSA
> > > graph forming the SLP tree rather than the current ad-hoc matching
> > > starting from special "sinks".
> > >
> >
> > This was also one of the original things I looked at, though I had
> > figured the amount of work you have to do to do the subgraph matching
> > is a lot more complicated.  I couldn't really think of any way of
> > minimizing back tracking etc into something reasonably efficient.
> >
> > Matching post build means I don't really have to backtrack because I
> > can use assumptions of vect_get_slp_defs to my advantage and just
> > access each  child node in constant time.
> 
> Yeah, I realize how you've done it is probably at the easiest and most 
> flexible
> point.  Still I fear we're engineering us into a corner here :/
> 
> You know I'm in the process of re-doing the vectorizer in terms of SLP-only.
> For the SLP build to better support BB vectorization and partial loop
> vectorization my idea is to start with a set of SLP instances 1:1 mapping the
> SSA graph (SLP instances are acutally the entries into the now shared SLP
> graph - it's no longer a tree) and then perform SLP node merging, forming
> SLP nodes with group sizes > 1.  During that, or rather as part of that, SLP
> pattern matching would come in, merging some of the SLP nodes, forming
> the group-size 2 complex math opts.

Ah yeah it would indeed fit here nicely since it wouldn't have to do the mini
partitioning itself at that time.

> 
> But rewriting SLP build is quite late on the plan of the rewrite...
> so in the current SLP build the pattern matching would hook into
> vect_build_slp_tree_1 when we discover alt_stmt_code.  Of course as you
> say it's likely more complex...
> 
> I wonder if re-synthesizing complex-type operations in regular pattern
> matching would help?
> 
> That said - I'm fine with going with your approach but only if you promise to
> help once I run into it during the SLP rewrite...

Of course!, more then happy to help! Thanks! I'll make the other changes and
send up the full patch series then.

Thanks,
Tamar

> 
> Thanks,
> Richard.
> 
> > Kind Regards,
> > Tamar
> >
> > > Thanks,
> > > Richard.
> > >
> > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > 2019-10-01  Tamar Christina  <tamar.christ...@arm.com>
> > > >
> > > >         * tree-vect-loop.c (vect_dissolve_slp_only_patterns): New.
> > > >         (vect_dissolve_slp_only_groups): Use macro.
> > > >         * tree-vect-patterns.c (vect_mark_pattern_stmts): Expose symbol.
> > > >         * tree-vect-slp.c (vect_free_slp_tree): Add control of 
> > > > recursion and
> how
> > > >         to free.
> > > >         (ssa_name_def_to_slp_tree_map_t): New.
> > > >         (vect_create_new_slp_node, vect_build_slp_tree): Use macro.
> > > >         (vect_create_slp_patt_stmt): New.
> > > >         (vect_match_slp_patterns_2): New.
> > > >         (vect_match_slp_patterns): New.
> > > >         (vect_analyze_slp_instance): Call vect_match_slp_patterns and
> undo
> > > >         permutes.
> > > >         (vect_detect_hybrid_slp_stmts): Dissolve relationships created 
> > > > for
> SLP.
> > > >         * tree-vectorizer.h (SLP_TREE_REF_COUNT): New.
> > > >         (vect_mark_pattern_stmts): New.
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguent...@suse.de>
> > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> > > Nuernberg, Germany; GF: Felix Imendörffer; HRB 247165 (AG München)
> >
> >
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409
> Nuernberg, Germany; GF: Felix Imendörffer; HRB 247165 (AG München)

Reply via email to