On Sun, Nov 11, 2018 at 11:26 AM Tamar Christina <tamar.christ...@arm.com> wrote: > > Hi All, > > This patch adds support for SLP vectorization of Complex number arithmetic > with rotations > along with Argand plane. > > For this to work it has to recognize two statements in parallel as it needs > to match > against operations towards both the real and imaginary numbers. The > instructions generated > by this change is also only available in their vector forms. As such I add > them as pattern > statements to the stmt. The BB is left untouched and so the scalar loop is > untouched. > > The instructions also require the loads to be contiguous and so when a match > is made, and > the code decides it is able to do the replacement it re-organizes the SLP > tree such that the > loads are contiguous. Since this operation cannot be undone it only does > this if it's sure that > the resulting loads can be made continguous. > > It gets this guarantee by only allowing the replacement if between the > matched expression and the > loads there are no other expressions it doesn't know aside from type casts. > > When a match occurs over multiple expressions, the dead statements are > immediately removed from the > tree to prevent verification failures later. > > Because the pattern matching is done after SLP analysis has analyzed the > usage of the instruction it > also marks the instructions as used and the old ones as unusued. > > When a replacement is done a new internal function is generated which the > back-end has to expand to > the proper instruction sequences. > > For now, this patch only adds support for Complex addition with rotate and > Complex FMLA > with rotation of 0 and 180. However it is the intention to in the future add > support for > Complex subtraction and Complex multiplication. > > Concretely, this generates > > ldr q1, [x1, x3] > ldr q2, [x0, x3] > ldr q0, [x2, x3] > fcmla v0.2d, v1.2d, v2.2d, #180 > fcmla v0.2d, v1.2d, v2.2d, #90 > str q0, [x2, x3] > add x3, x3, 16 > cmp x3, 3200 > bne .L2 > ret > > now instead of > > add x3, x2, 31 > sub x4, x3, x1 > sub x3, x3, x0 > cmp x4, 62 > mov x4, 62 > ccmp x3, x4, 0, hi > bls .L5 > mov x3, x0 > mov x0, x1 > add x1, x2, 3200 > .p2align 3,,7 > .L3: > ld2 {v16.2d - v17.2d}, [x2] > ld2 {v2.2d - v3.2d}, [x3], 32 > ld2 {v0.2d - v1.2d}, [x0], 32 > mov v7.16b, v17.16b > fmul v6.2d, v0.2d, v3.2d > fmla v7.2d, v1.2d, v3.2d > fmla v6.2d, v1.2d, v2.2d > fmls v7.2d, v2.2d, v0.2d > fadd v4.2d, v6.2d, v16.2d > mov v5.16b, v7.16b > st2 {v4.2d - v5.2d}, [x2], 32 > cmp x2, x1 > bne .L3 > ret > .L5: > add x4, x2, 8 > add x6, x0, 8 > add x5, x1, 8 > mov x3, 0 > .p2align 3,,7 > .L2: > ldr d1, [x6, x3] > ldr d4, [x1, x3] > ldr d5, [x5, x3] > ldr d3, [x0, x3] > fmul d2, d4, d1 > ldr d0, [x4, x3] > fmadd d0, d5, d1, d0 > ldr d1, [x2, x3] > fmadd d2, d5, d3, d2 > fmsub d0, d4, d3, d0 > fadd d1, d1, d2 > str d1, [x2, x3] > str d0, [x4, x3] > add x3, x3, 16 > cmp x3, 3200 > bne .L2 > ret > > Bootstrap and Regtests on aarch64-none-linux-gnu, arm-none-gnueabihf and > x86_64-pc-linux-gnu > are still on going but previous patch showed no regressions. > > The instructions have also been tested on aarch64-none-elf and arm-none-eabi > on a Armv8.3-a model > and -march=Armv8.3-a+fp16 and all tests pass. > > Ok for trunk?
I first have a few high-level questions. Complex addition when the complex values are in vectors looks trivial to me and maps to vector addition. Your md.texi description of fcadd mentions a rotation 'm' but doesn't further explain the details - I suppose fcadd@var{m}@var{n}3 really means fcadd90@var{n}3 and fcadd270@var{n}3, thus the rotation being encoded into the pattern name rather than having a mode m and an explicit operand? If that is so please list those patterns explicitely and separately. Then I'm not sure why the vectorizer needs to be bothered with this? Can the instructions not be recognized by combine from the "rotate" and the vector add? That is, what happens if the user writes generic vector code for this? Can you also explicitely spell out what "rotation by 90 or 270" means for the operation? If I decipher the text OK then only one operand (operand 2) is rotated and operand 1 is unchanged? Or is the result rotated instead? So it looks like the patch adds a general facility to recognize patterns on the SLP graph after SLP discovery. This complicates matters a lot it seems. Comments (unordered) about the actual patch: +static slp_tree +vect_match_slp_patterns_1 (slp_tree node, vec_info *vinfo, + unsigned int group_size, complex_pattern_t patt_info) +{ + int i; + stmt_vec_info stmt_info; + hash_set<stmt_vec_info> exempt; + auto_vec<tree> v1, v2; + vec<stmt_vec_info> scalar_stmts = SLP_TREE_SCALAR_STMTS (node); + bool keep_looking = false; + + if (group_size != 2) + return node; this seems quite arbitrary given once the user unrolls the loops you'll be faced with two complex operations. Did you at least think about how to generalize this without trying all permutations? What about vector ISAs where two or more complex numbers fit inside a vector? (SVE?) + /* If we haven't matched anything, look deeper. */ + if (keep_looking) + { + slp_tree child; + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (node), i, child) + SLP_TREE_CHILDREN (node)[i] + = vect_match_slp_patterns_1 (child, vinfo, group_size, patt_info); + } usually pattern matching should work from defs to uses, thus recurse first? And, + keep_looking = gimple_assign_cast_p (stmt_0) + || gimple_assign_load_p (stmt_0) + || gimple_store_p (stmt_0); loads will never have any SLP children. Stores are always the very first SLP node only. This effectively means you never look deeper than the very first arithmetic operation, just skipping an eventual widening/shortening?! + /* Now push new statements. */ + SLP_TREE_SCALAR_STMTS (node).truncate (0); + + gcall *call_stmt_0 + = vect_create_complex_patt_stmt (node, opstmt_info_0, ifn, + v1, vinfo, type, vectype); + gcall *call_stmt_1 + = vect_create_complex_patt_stmt (node, opstmt_info_1, ifn, + v2, vinfo, type, vectype); do not perform the SLP tree modification in those helpers, that looks odd. In that function I see you do sth like + vect_mark_pattern_stmts (stmt_info, call_stmt, vectype); + STMT_VINFO_RELEVANT (stmt_info) = vect_unused_in_scope; + STMT_VINFO_RELEVANT (call_stmt_info) = vect_used_in_scope; I don't think that will fly given that the decision on whether the stmt is used only by SLP is done only later (hybrid SLP). Usually patterns allow the original pieces still to be used, not sure why you try to be clever and not allow this. They might end up being used by other SLP instances as well which are either already detected or only will be discovered later. + /* Re-order the loads, first in the SLP tree. */ + vect_delete_slp_nodes (&node, node); + vect_balance_load_nodes (node, group_size); + SLP_TREE_TWO_OPERATORS (node) = false; +static bool +vect_is_complex_transform_valid (slp_tree node, hash_set<stmt_vec_info> exempt) +{ this function doesn't look generic - why's it not simply called from the pattern recognition function? I also do not understand why there cannot be any operations between the complex add and the loads? IMHO vect_delete_slp_nodes_1 shouldn't do any stmt removal (there's nothing to DSE, only intermediate scalar operations become dead). Again you are hard-coding knowledge of the specific patterns in this function, instead the pattern recognition function should deal with this. Note the SLP tree is now a graph so there can be multiple uses of a node. Overall I'm not convinced the SLP pattern matching will work in the way you implemented it. At least it has to be moved later until after vect_detect_hybrid_slp where we know whether stmts are also used by non-SLP parts and when we have discovered all SLP instances. Then the pattern detection needs to apply more broadly, thus you should relax the constraints you put on earlier and later operations. I'd not do any of the tieing to loads or order of loads - as far as I understand the instructions do not perform loads themselves. And your pictures of the rotation effect only mentions changed operations, not shuffles. That is, optimizing shuffles should be done by combine later (or by the much sought SLP shuffle optimization). For the add you probably can get it all done by combine. For the FMA did you think about handling it in regular pattern recognition instead? Your matching is so constrained that this shouldn't be too difficult. Thanks, Richard. > Thanks, > Tamar > > gcc/ChangeLog: > > 2018-11-11 Tamar Christina <tamar.christ...@arm.com> > > * doc/md.texi (fcadd, fcmla): New. > * doc/sourcebuild.texi (vect_complex_rot_): New. > * internal-fn.def (FCOMPLEX_ADD_ROT90): New. > (FCOMPLEX_ADD_ROT270): New. > (FCOMPLEX_FMA_ROT0): New. > (FCOMPLEX_FMA_ROT180): New. > * optabs.def (fcadd90_optab, fcadd270_optab, > fcmla0_optab, fcmla180_optab): New. > * tree-vect-patterns.c (vect_mark_pattern_stmts): Export function. > * tree-vect-slp.c (vect_create_complex_patt_stmt): New. > (vect_is_complex_transform_valid): New. > (vect_reorder_based_on_load_chain): New. > (vect_determine_place_in_load_1): New. > (vect_determine_place_in_load): New. > (vect_balance_load_nodes_1): New. > (vect_balance_load_nodes): New. > (vect_match_expression_p): New. > (vect_detect_pair_op): New. > (vect_delete_slp_nodes_1): New. > (vect_delete_slp_nodes): New. > (vect_match_slp_patterns_1): New. > (vect_match_call_complex_add): New. > (vect_match_call_complex_mla_1): New. > (vect_match_call_complex_mla): New. > (vect_match_slp_patterns): New. > (vect_analyze_slp_instance): Use it. > * tree-vectorizer.h (vect_mark_pattern_stmts): Export function. > > --