Re: [PATCH 1/9][GCC][AArch64][middle-end] Implement SLP recognizer for Complex addition with rotate and complex MLA with rotation

Richard Biener Wed, 14 Nov 2018 04:21:46 -0800

On Sun, Nov 11, 2018 at 11:26 AM Tamar Christina
<tamar.christ...@arm.com> wrote:
>
> Hi All,
>
> This patch adds support for SLP vectorization of Complex number arithmetic 
> with rotations
> along with Argand plane.
>
> For this to work it has to recognize two statements in parallel as it needs 
> to match
> against operations towards both the real and imaginary numbers.  The 
> instructions generated
> by this change is also only available in their vector forms.  As such I add 
> them as pattern
> statements to the stmt.  The BB is left untouched and so the scalar loop is 
> untouched.
>
> The instructions also require the loads to be contiguous and so when a match 
> is made, and
> the code decides it is able to do the replacement it re-organizes the SLP 
> tree such that the
> loads are contiguous.  Since this operation cannot be undone it only does 
> this if it's sure that
> the resulting loads can be made continguous.
>
> It gets this guarantee by only allowing the replacement if between the 
> matched expression and the
> loads there are no other expressions it doesn't know aside from type casts.
>
> When a match occurs over multiple expressions, the dead statements are 
> immediately removed from the
> tree to prevent verification failures later.
>
> Because the pattern matching is done after SLP analysis has analyzed the 
> usage of the instruction it
> also marks the instructions as used and the old ones as unusued.
>
> When a replacement is done a new internal function is generated which the 
> back-end has to expand to
> the proper instruction sequences.
>
> For now, this patch only adds support for Complex addition with rotate and 
> Complex FMLA
> with rotation of 0 and 180. However it is the intention to in the future add 
> support for
> Complex subtraction and Complex multiplication.
>
> Concretely, this generates
>
>         ldr     q1, [x1, x3]
>         ldr     q2, [x0, x3]
>         ldr     q0, [x2, x3]
>         fcmla   v0.2d, v1.2d, v2.2d, #180
>         fcmla   v0.2d, v1.2d, v2.2d, #90
>         str     q0, [x2, x3]
>         add     x3, x3, 16
>         cmp     x3, 3200
>         bne     .L2
>         ret
>
> now instead of
>
>         add     x3, x2, 31
>         sub     x4, x3, x1
>         sub     x3, x3, x0
>         cmp     x4, 62
>         mov     x4, 62
>         ccmp    x3, x4, 0, hi
>         bls     .L5
>         mov     x3, x0
>         mov     x0, x1
>         add     x1, x2, 3200
>         .p2align 3,,7
> .L3:
>         ld2     {v16.2d - v17.2d}, [x2]
>         ld2     {v2.2d - v3.2d}, [x3], 32
>         ld2     {v0.2d - v1.2d}, [x0], 32
>         mov     v7.16b, v17.16b
>         fmul    v6.2d, v0.2d, v3.2d
>         fmla    v7.2d, v1.2d, v3.2d
>         fmla    v6.2d, v1.2d, v2.2d
>         fmls    v7.2d, v2.2d, v0.2d
>         fadd    v4.2d, v6.2d, v16.2d
>         mov     v5.16b, v7.16b
>         st2     {v4.2d - v5.2d}, [x2], 32
>         cmp     x2, x1
>         bne     .L3
>         ret
> .L5:
>         add     x4, x2, 8
>         add     x6, x0, 8
>         add     x5, x1, 8
>         mov     x3, 0
>         .p2align 3,,7
> .L2:
>         ldr     d1, [x6, x3]
>         ldr     d4, [x1, x3]
>         ldr     d5, [x5, x3]
>         ldr     d3, [x0, x3]
>         fmul    d2, d4, d1
>         ldr     d0, [x4, x3]
>         fmadd   d0, d5, d1, d0
>         ldr     d1, [x2, x3]
>         fmadd   d2, d5, d3, d2
>         fmsub   d0, d4, d3, d0
>         fadd    d1, d1, d2
>         str     d1, [x2, x3]
>         str     d0, [x4, x3]
>         add     x3, x3, 16
>         cmp     x3, 3200
>         bne     .L2
>         ret
>
> Bootstrap and Regtests on aarch64-none-linux-gnu, arm-none-gnueabihf and 
> x86_64-pc-linux-gnu
> are still on going but previous patch showed no regressions.
>
> The instructions have also been tested on aarch64-none-elf and arm-none-eabi 
> on a Armv8.3-a model
> and -march=Armv8.3-a+fp16 and all tests pass.
>
> Ok for trunk?


I first have a few high-level questions.  Complex addition when the
complex values are in
vectors looks trivial to me and maps to vector addition.  Your md.texi
description of
fcadd mentions a rotation 'm' but doesn't further explain the details
- I suppose
fcadd@var{m}@var{n}3 really means fcadd90@var{n}3 and fcadd270@var{n}3, thus
the rotation being encoded into the pattern name rather than having a mode m and
an explicit operand?  If that is so please list those patterns
explicitely and separately.

Then I'm not sure why the vectorizer needs to be bothered with this?  Can the
instructions not be recognized by combine from the "rotate" and the vector add?
That is, what happens if the user writes generic vector code for this?

Can you also explicitely spell out what "rotation by 90 or 270" means for
the operation?  If I decipher the text OK then only one operand (operand 2)
is rotated and operand 1 is unchanged?  Or is the result rotated instead?

So it looks like the patch adds a general facility to recognize patterns
on the SLP graph after SLP discovery.  This complicates matters a lot
it seems.  Comments (unordered) about the actual patch:

+static slp_tree
+vect_match_slp_patterns_1 (slp_tree node, vec_info *vinfo,
+                          unsigned int group_size, complex_pattern_t patt_info)
+{
+  int i;
+  stmt_vec_info stmt_info;
+  hash_set<stmt_vec_info> exempt;
+  auto_vec<tree> v1, v2;
+  vec<stmt_vec_info> scalar_stmts = SLP_TREE_SCALAR_STMTS (node);
+  bool keep_looking = false;
+
+  if (group_size != 2)
+    return node;

this seems quite arbitrary given once the user unrolls the loops you'll be faced
with two complex operations.  Did you at least think about how to generalize
this without trying all permutations?  What about vector ISAs where two
or more complex numbers fit inside a vector? (SVE?)

+  /* If we haven't matched anything, look deeper.  */
+  if (keep_looking)
+    {
+      slp_tree child;
+      FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child)
+       SLP_TREE_CHILDREN (node)[i]
+         = vect_match_slp_patterns_1 (child, vinfo, group_size, patt_info);
+    }

usually pattern matching should work from defs to uses, thus recurse
first?  And,

+      keep_looking = gimple_assign_cast_p (stmt_0)
+                    || gimple_assign_load_p (stmt_0)
+                    || gimple_store_p (stmt_0);

loads will never have any SLP children.  Stores are always the very
first SLP node
only.  This effectively means you never look deeper than the very
first arithmetic
operation, just skipping an eventual widening/shortening?!

+         /* Now push new statements.  */
+         SLP_TREE_SCALAR_STMTS (node).truncate (0);
+
+         gcall *call_stmt_0
+           = vect_create_complex_patt_stmt (node, opstmt_info_0, ifn,
+                                            v1, vinfo, type, vectype);
+         gcall *call_stmt_1
+           = vect_create_complex_patt_stmt (node, opstmt_info_1, ifn,
+                                            v2, vinfo, type, vectype);

do not perform the SLP tree modification in those helpers, that looks odd.
In that function I see you do sth like

+  vect_mark_pattern_stmts (stmt_info, call_stmt, vectype);
+  STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope;
+  STMT_VINFO_RELEVANT (call_stmt_info) = vect_used_in_scope;

I don't think that will fly given that the decision on whether the stmt is
used only by SLP is done only later (hybrid SLP).  Usually patterns
allow the original pieces still to be used, not sure why you try to be
clever and not allow this.  They might end up being used by other
SLP instances as well which are either already detected or
only will be discovered later.

+         /* Re-order the loads, first in the SLP tree.  */
+         vect_delete_slp_nodes (&node, node);
+         vect_balance_load_nodes (node, group_size);
+         SLP_TREE_TWO_OPERATORS (node) = false;

+static bool
+vect_is_complex_transform_valid (slp_tree node, hash_set<stmt_vec_info> exempt)
+{

this function doesn't look generic - why's it not simply called from
the pattern recognition function?  I also do not understand why there
cannot be any operations between the complex add and the loads?

IMHO vect_delete_slp_nodes_1 shouldn't do any stmt removal
(there's nothing to DSE, only intermediate scalar operations become dead).
Again you are hard-coding knowledge of the specific patterns in this
function, instead the pattern recognition function should deal with this.
Note the SLP tree is now a graph so there can be multiple uses of a node.

Overall I'm not convinced the SLP pattern matching will work in the way
you implemented it.  At least it has to be moved later until after
vect_detect_hybrid_slp where we know whether stmts are also used
by non-SLP parts and when we have discovered all SLP instances.

Then the pattern detection needs to apply more broadly, thus you
should relax the constraints you put on earlier and later operations.

I'd not do any of the tieing to loads or order of loads - as far as I
understand the instructions do not perform loads themselves.
And your pictures of the rotation effect only mentions changed
operations, not shuffles.  That is, optimizing shuffles should be
done by combine later (or by the much sought SLP shuffle
optimization).

For the add you probably can get it all done by combine.

For the FMA did you think about handling it in regular
pattern recognition instead?  Your matching is so
constrained that this shouldn't be too difficult.

Thanks,
Richard.


> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 2018-11-11  Tamar Christina  <tamar.christ...@arm.com>
>
>         * doc/md.texi (fcadd, fcmla): New.
>         * doc/sourcebuild.texi (vect_complex_rot_): New.
>         * internal-fn.def (FCOMPLEX_ADD_ROT90): New.
>         (FCOMPLEX_ADD_ROT270): New.
>         (FCOMPLEX_FMA_ROT0): New.
>         (FCOMPLEX_FMA_ROT180): New.
>         * optabs.def (fcadd90_optab, fcadd270_optab,
>         fcmla0_optab, fcmla180_optab):  New.
>         * tree-vect-patterns.c (vect_mark_pattern_stmts): Export function.
>         * tree-vect-slp.c (vect_create_complex_patt_stmt): New.
>         (vect_is_complex_transform_valid): New.
>         (vect_reorder_based_on_load_chain): New.
>         (vect_determine_place_in_load_1): New.
>         (vect_determine_place_in_load): New.
>         (vect_balance_load_nodes_1): New.
>         (vect_balance_load_nodes): New.
>         (vect_match_expression_p): New.
>         (vect_detect_pair_op): New.
>         (vect_delete_slp_nodes_1): New.
>         (vect_delete_slp_nodes): New.
>         (vect_match_slp_patterns_1): New.
>         (vect_match_call_complex_add): New.
>         (vect_match_call_complex_mla_1): New.
>         (vect_match_call_complex_mla): New.
>         (vect_match_slp_patterns): New.
>         (vect_analyze_slp_instance): Use it.
>         * tree-vectorizer.h (vect_mark_pattern_stmts): Export function.
>
> --

Re: [PATCH 1/9][GCC][AArch64][middle-end] Implement SLP recognizer for Complex addition with rotate and complex MLA with rotation

Reply via email to