The following is a prototype for how to represent load/store-lanes within SLP. I've for now settled with having a single load node with multiple permute nodes, one for each loaded lane and a single store node plus a single permute node feeding it. For
for (int i = 0; i < 1024; ++i) { a[2*i] = b[2*i] + 7; a[2*i+1] = b[2*i+1] * 3; } you have the following SLP graph where I explain how things are set up and code-generated: t.c:3:21: note: SLP graph after lowering permutations: t.c:3:21: note: node 0x578af60 (max_nunits=1, refcnt=1) vector([4,4]) int t.c:3:21: note: op template: *_6 = _7; t.c:3:21: note: stmt 0 *_6 = _7; t.c:3:21: note: stmt 1 *_12 = _13; t.c:3:21: note: children 0x578aff8 This is the store node, it's marked with ldst_lanes = true during SLP discovery. This node code-generates .MASK_LEN_STORE_LANES (vectp_a.10_30, 32B, { -1, ... }, loop_len_38, 0, vect_array.9); t.c:3:21: note: node 0x578aff8 (max_nunits=1, refcnt=1) vector([4,4]) int t.c:3:21: note: op: VEC_PERM_EXPR t.c:3:21: note: { } t.c:3:21: note: lane permutation { 0[0] 1[0] } t.c:3:21: note: children 0x578ab38 0x578ad98 This is the store-lane shuffling node, it's also marked with ldst_lanes = true during SLP discovery. This node code-generates vect_array.9[0] = vect__7.7_35; vect_array.9[1] = vect__13.8_33; it records virtual defs as vectorized defs. This is a bit awkward, on the store side it would be nicer to elide 'vect_array' and instead make .MASK_LEN_STORE_LANES "varargs", having the vector defs as arguments. t.c:3:21: note: node 0x578ab38 (max_nunits=4, refcnt=2) vector([4,4]) int t.c:3:21: note: op template: _7 = _5 + 7; t.c:3:21: note: stmt 0 _7 = _5 + 7; t.c:3:21: note: children 0x578abd0 0x578ac68 t.c:3:21: note: node (constant) 0x578ac68 (max_nunits=1, refcnt=1) t.c:3:21: note: { 7 } t.c:3:21: note: node 0x578b978 (max_nunits=2, refcnt=2) vector(2) int t.c:3:21: note: op template: _13 = _11 * 3; t.c:3:21: note: stmt 0 _13 = _11 * 3; t.c:3:21: note: children 0x578b4b8 0x578b8e0 t.c:3:21: note: node (constant) 0x578b8e0 (max_nunits=1, refcnt=1) t.c:3:21: note: { 3 } t.c:3:21: note: node 0x578abd0 (max_nunits=4, refcnt=2) vector([4,4]) int t.c:3:21: note: op: VEC_PERM_EXPR t.c:3:21: note: stmt 0 _5 = *_4; t.c:3:21: note: lane permutation { 0[0] } t.c:3:21: note: children 0x578b090 t.c:3:21: note: node 0x578b4b8 (max_nunits=2, refcnt=2) vector(2) int t.c:3:21: note: op: VEC_PERM_EXPR t.c:3:21: note: stmt 0 _11 = *_10; t.c:3:21: note: lane permutation { 0[1] } t.c:3:21: note: children 0x578bbd8 The above two are the load-lane shufflings, they are marked with ldst_lanes during load permutation lowering. They code generate _34 = vect_array.6[1]; _36 = vect_array.6[0]; they pick up the vect_array.6 by looking at the virtual operand def in the children node. As with the store side this is awkward but much harder to avoid since we lack multiple SSA defs support. What could be done though is to write vect_array.6 into SSA and use BIT_FIELD_REF for the extracts? (yes, that would require ARRAY_TYPE gimple registers) t.c:3:21: note: node 0x578b090 (max_nunits=4, refcnt=3) vector([4,4]) int t.c:3:21: note: op template: _5 = *_4; t.c:3:21: note: stmt 0 _5 = *_4; t.c:3:21: note: stmt 1 _11 = *_10; t.c:3:21: note: load permutation { 0 1 } This is the load itself, marked ldst_lanes during load permutation lowering. It code generates vect_array.6 = .MASK_LEN_LOAD_LANES (vectp_b.4_40, 32B, { -1, ... }, loop_len_38, 0); The patch below, tested with just the above testcase and on riscv, aarch64 with and without SVE and on x86_64 (no load/store lanes) seems to work though definitely the theoretical support for multi-lane lanes (aka using ld3 for 3 lane pairs and a total of 6 lanes) is broken. Code generation for AdvSIMD is ld2 {v30.4s - v31.4s}, [x1], 32 shl v29.4s, v31.4s, 1 add v28.4s, v27.4s, v30.4s add v29.4s, v29.4s, v31.4s st2 {v28.4s - v29.4s}, [x0], 32 for SVE ld2w {z30.s - z31.s}, p7/z, [x1, x2, lsl 2] add z30.s, z30.s, #7 adr z31.s, [z31.s, z31.s, lsl 1] st2w {z30.s - z31.s}, p7, [x0, x2, lsl 2] and for RVV vsetvli zero,a3,e32,m1,ta,ma vlseg2e32.v v4,(a1) vsetvli t1,zero,e32,m1,ta,ma sub a5,a5,a4 add a1,a1,a2 vmul.vv v3,v1,v5 vadd.vi v2,v4,7 vsetvli zero,a3,e32,m1,ta,ma vsseg2e32.v v2,(a0) The way this SLP representation forces us to separate IFN and array access code generation to be separated introduces problems not present with non-SLP. Keeping the operation in one node would be managable on the store side by recording an incoming lane permutation for it. On the load side we could code generate everything from the load node and make a special contract between the "extract" and the load (but I suppose that's already present with the patch below), it would then become just a forwarder doing no code generation. I'm posting this as-is, I will probably try the alternative to decide which is more ugly. I guess I'm hoping for magic other ideas here ;) * tree-vectorizer.h (_slp_tree::ldst_lanes): New flag to mark load, store and permute nodes. * tree-vect-slp.cc (vect_build_slp_instance): For stores iff the target prefers store-lanes discover single-lane sub-groups, do not perform interleaving lowering but mark the nodes with ldst_lanes. (vect_lower_load_permutations): When the target supports load lanes and the loads all fit the pattern split out a single level of permutes only and mark the load and permute nodes with ldst_lanes. (vectorizable_slp_permutation_1): Code-generate the load-/store-lane permute parts. * tree-vect-stmts.cc (get_group_load_store_type): Support load/store-lanes for SLP. (vectorizable_store): Hack code generation for store-lanes. (vectorizable_load): Hack code generation for load-lanes. --- gcc/tree-vect-slp.cc | 147 +++++++++++++++++++++++++++++++++++++++-- gcc/tree-vect-stmts.cc | 68 ++++++++++++++----- gcc/tree-vectorizer.h | 3 + 3 files changed, 197 insertions(+), 21 deletions(-) diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 2dc6d365303..a5648cacd70 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -120,6 +120,7 @@ _slp_tree::_slp_tree () SLP_TREE_SIMD_CLONE_INFO (this) = vNULL; SLP_TREE_DEF_TYPE (this) = vect_uninitialized_def; SLP_TREE_CODE (this) = ERROR_MARK; + this->ldst_lanes = false; SLP_TREE_VECTYPE (this) = NULL_TREE; SLP_TREE_REPRESENTATIVE (this) = NULL; SLP_TREE_REF_COUNT (this) = 1; @@ -3601,9 +3602,22 @@ vect_build_slp_instance (vec_info *vinfo, size >= 1. */ else if (is_a <loop_vec_info> (vinfo) && (i > 0 && i < group_size) - && !vect_slp_prefer_store_lanes_p (vinfo, - stmt_info, group_size, i)) - { + /*&& !vect_slp_prefer_store_lanes_p (vinfo, + stmt_info, group_size, i)*/) + { + /* There are targets that cannot do even/odd interleaving schemes + so they absolutely need to use load/store-lanes. For now + force single-lane SLP for them - they would be happy with + uniform power-of-two lanes (but depending on element size), + but even if we can use 'i' as indicator we would need to + backtrack when later lanes fail to discover with the same + granularity. */ + bool want_store_lanes + = vect_slp_prefer_store_lanes_p (vinfo, + stmt_info, group_size, 1 /*i*/); + if (want_store_lanes) + i = 1; + if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "Splitting SLP group at stmt %u\n", i); @@ -3637,7 +3651,10 @@ vect_build_slp_instance (vec_info *vinfo, (max_nunits, end - start)); rhs_nodes.safe_push (node); start = end; - end = group_size; + if (want_store_lanes) + end = start + 1; + else + end = group_size; } else { @@ -3711,7 +3728,14 @@ vect_build_slp_instance (vec_info *vinfo, we end up with two inputs. That scheme makes sure we end up with permutes satisfying the restriction of requiring at most two vector inputs to produce a single vector output - when the number of lanes is even. */ + when the number of lanes is even. + Leave the merge as is when we want to use store-lanes. */ + if (want_store_lanes) + { + perm->ldst_lanes = 1; + node->ldst_lanes = 1; + } + else while (SLP_TREE_CHILDREN (perm).length () > 2) { /* When we have three equal sized groups left the pairwise @@ -4057,6 +4081,36 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo, if (exact_log2 (group_lanes) == -1 && group_lanes != 3) return; + /* Verify if all load permutations can be implemented with a suitably + large element load-lanes operation. */ + unsigned ld_lanes_lanes = SLP_TREE_LANES (loads[0]); + if (exact_log2 (ld_lanes_lanes) == -1 + /* ??? Need to handle ld_lanes_lanes != 1 correctly. */ + || vect_load_lanes_supported (SLP_TREE_VECTYPE (loads[0]), + group_lanes, false) == IFN_LAST) + ld_lanes_lanes = 0; + else + for (slp_tree load : loads) + { + if (SLP_TREE_LANES (load) != ld_lanes_lanes) + { + ld_lanes_lanes = 0; + break; + } + unsigned first = SLP_TREE_LOAD_PERMUTATION (load)[0]; + if (first % ld_lanes_lanes != 0) + { + ld_lanes_lanes = 0; + break; + } + for (unsigned i = 1; i < SLP_TREE_LANES (load); ++i) + if (SLP_TREE_LOAD_PERMUTATION (load)[i] != first + i) + { + ld_lanes_lanes = 0; + break; + } + } + for (slp_tree load : loads) { /* Leave masked or gather loads alone for now. */ @@ -4071,7 +4125,8 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo, with a non-1:1 load permutation around instead of canonicalizing those into a load and a permute node. Removing this early check would do such canonicalization. */ - if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2) + if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2 + && ld_lanes_lanes == 0) continue; /* First build (and possibly re-use) a load node for the @@ -4104,6 +4159,12 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo, final_perm.quick_push (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i])); + if (ld_lanes_lanes != 0) + { + l0->ldst_lanes = true; + load->ldst_lanes = true; + } + else while (1) { unsigned group_lanes = SLP_TREE_LANES (l0); @@ -9705,6 +9766,8 @@ vect_add_slp_permutation (vec_info *vinfo, gimple_stmt_iterator *gsi, node->push_vec_def (perm_stmt); } +extern tree create_vector_array (tree elem_type, unsigned HOST_WIDE_INT nelems); + /* Subroutine of vectorizable_slp_permutation. Check whether the target can perform permutation PERM on the (1 or 2) input nodes in CHILDREN. If GSI is nonnull, emit the permutation there. @@ -9758,6 +9821,78 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, gimple_stmt_iterator *gsi, gcc_assert (perm.length () == SLP_TREE_LANES (node)); + /* Load/store-lanes permute. */ + if (node->ldst_lanes) + { + if (!gsi) + /* ??? We should have this flag set on the permute before a store + or the permutes after a load. Check vect_load_lanes_supported + or leave that to the store? */ + return 1; + /* For the load case there should be a single child and + it's vector def should be the virtual SSA def of the + .LOAD_LANES call. Create the vect__5.7_48 = vect_array.6[n] + load here plus punning of the vector type (in case an element + represents multiple lanes). */ + if (children.length () == 1) + { + slp_tree child = children[0]; + unsigned vec_num = (SLP_TREE_LANE_PERMUTATION (node)[0].second + / SLP_TREE_LANES (node)); + for (unsigned i = 0; i < SLP_TREE_NUMBER_OF_VEC_STMTS (node); ++i) + { + tree def = SLP_TREE_VEC_DEFS (child)[i]; + gcc_assert (virtual_operand_p (def)); + tree decl = gimple_call_lhs (SSA_NAME_DEF_STMT (def)); + tree vec = make_ssa_name (TREE_TYPE (TREE_TYPE (decl))); + gassign *new_stmt + = gimple_build_assign (vec, + build4 (ARRAY_REF, TREE_TYPE (vec), + decl, build_int_cstu + (unsigned_type_node, vec_num), + NULL_TREE, NULL_TREE)); + vect_finish_stmt_generation (vinfo, NULL, new_stmt, gsi); + if (!useless_type_conversion_p (vectype, TREE_TYPE (vec))) + { + vec = make_ssa_name (vectype); + new_stmt = gimple_build_assign (vec, build1 (VIEW_CONVERT_EXPR, vectype, gimple_assign_lhs (new_stmt))); + vect_finish_stmt_generation (vinfo, NULL, new_stmt, gsi); + } + node->push_vec_def (vec); + } + return SLP_TREE_VEC_DEFS (node).length (); + } + /* For the store case create the array and the set of stores from + all children vector defs. The vector def should then be the + virtual def of the last store into the array. */ + else + { + unsigned vec_num = SLP_TREE_VEC_DEFS (children[0]).length (); + for (unsigned i = 0; i < vec_num; ++i) + { + tree decl = create_vector_array (op_vectype, + SLP_TREE_LANES (node) + / SLP_TREE_LANES (children[0])); + tree last_vdef = NULL_TREE; + for (unsigned j = 0; j < children.length (); ++j) + { + slp_tree child = children[j]; + tree ref = build4 (ARRAY_REF, op_vectype, decl, + build_int_cstu (unsigned_type_node, j), + NULL_TREE, NULL_TREE); + gimple *new_stmt + = gimple_build_assign (ref, SLP_TREE_VEC_DEFS (child)[i]); + vect_finish_stmt_generation (vinfo, NULL, new_stmt, gsi); + last_vdef = make_ssa_name (gimple_vop (cfun)); + SSA_NAME_DEF_STMT (last_vdef) = new_stmt; + gimple_set_vdef (new_stmt, last_vdef); + } + node->push_vec_def (last_vdef); + } + return vec_num; + } + } + /* REPEATING_P is true if every output vector is guaranteed to use the same permute vector. We can handle that case for both variable-length and constant-length vectors, but we only handle other cases for diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index fdcda0d2aba..7169ad4f9c9 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -146,7 +146,7 @@ record_stmt_cost (stmt_vector_for_cost *body_cost_vec, int count, /* Return a variable of type ELEM_TYPE[NELEMS]. */ -static tree +tree create_vector_array (tree elem_type, unsigned HOST_WIDE_INT nelems) { return create_tmp_var (build_array_type_nelts (elem_type, nelems), @@ -1508,7 +1508,8 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype, unsigned int nvectors; if (slp_node) - nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); + /* ??? Incorrect for multi-lane lanes. */ + nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) / group_size; else nvectors = vect_get_num_copies (loop_vinfo, vectype); @@ -2069,6 +2070,14 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, is irrelevant for them. */ *alignment_support_scheme = dr_unaligned_supported; } + /* Try using LOAD/STORE_LANES. */ + else if (slp_node->ldst_lanes + && (*lanes_ifn + = (vls_type == VLS_LOAD + ? vect_load_lanes_supported (vectype, group_size, masked_p) + : vect_store_lanes_supported (vectype, group_size, + masked_p))) != IFN_LAST) + *memory_access_type = VMAT_LOAD_STORE_LANES; else *memory_access_type = VMAT_CONTIGUOUS; @@ -8762,7 +8771,7 @@ vectorizable_store (vec_info *vinfo, if (memory_access_type == VMAT_LOAD_STORE_LANES) { - gcc_assert (!slp && grouped_store); + //gcc_assert (!slp && grouped_store); unsigned inside_cost = 0, prologue_cost = 0; /* For costing some adjacent vector stores, we'd like to cost with the total number of them once instead of cost each one by one. */ @@ -8784,7 +8793,7 @@ vectorizable_store (vec_info *vinfo, op = vect_get_store_rhs (next_stmt_info); if (costing_p) update_prologue_cost (&prologue_cost, op); - else + else if (!slp) { vect_get_vec_defs_for_operand (vinfo, next_stmt_info, ncopies, op, @@ -8799,15 +8808,15 @@ vectorizable_store (vec_info *vinfo, { if (mask) { - vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies, - mask, &vec_masks, - mask_vectype); + if (slp_node) + vect_get_slp_defs (mask_node, &vec_masks); + else + vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies, + mask, &vec_masks, + mask_vectype); vec_mask = vec_masks[0]; } - /* We should have catched mismatched types earlier. */ - gcc_assert ( - useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd))); dataref_ptr = vect_create_data_ref_ptr (vinfo, first_stmt_info, aggr_type, NULL, offset, &dummy, @@ -8819,11 +8828,17 @@ vectorizable_store (vec_info *vinfo, gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo)); /* DR_CHAIN is then used as an input to vect_permute_store_chain(). */ + if (!slp) + { + /* We should have catched mismatched types earlier. */ + gcc_assert ( + useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd))); for (i = 0; i < group_size; i++) { vec_oprnd = (*gvec_oprnds[i])[j]; dr_chain[i] = vec_oprnd; } + } if (mask) vec_mask = vec_masks[j]; dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr, gsi, @@ -8836,8 +8851,16 @@ vectorizable_store (vec_info *vinfo, continue; } + tree vec_array; + if (slp) + { + tree def = SLP_TREE_VEC_DEFS (SLP_TREE_CHILDREN (slp_node)[0])[j]; + vec_array = TREE_OPERAND (gimple_assign_lhs (SSA_NAME_DEF_STMT (def)), 0); + } + else + { /* Get an array into which we can store the individual vectors. */ - tree vec_array = create_vector_array (vectype, vec_num); + vec_array = create_vector_array (vectype, vec_num); /* Invalidate the current contents of VEC_ARRAY. This should become an RTL clobber too, which prevents the vector registers @@ -8851,6 +8874,7 @@ vectorizable_store (vec_info *vinfo, write_vector_array (vinfo, stmt_info, gsi, vec_oprnd, vec_array, i); } + } tree final_mask = NULL; tree final_len = NULL; @@ -8917,9 +8941,10 @@ vectorizable_store (vec_info *vinfo, /* Record that VEC_ARRAY is now dead. */ vect_clobber_variable (vinfo, stmt_info, gsi, vec_array); - if (j == 0) + if (j == 0 && !slp) *vec_stmt = new_stmt; - STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); + if (!slp) + STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); } if (costing_p) @@ -10765,12 +10790,14 @@ vectorizable_load (vec_info *vinfo, { gcc_assert (alignment_support_scheme == dr_aligned || alignment_support_scheme == dr_unaligned_supported); - gcc_assert (grouped_load && !slp); + //gcc_assert (grouped_load && !slp); unsigned int inside_cost = 0, prologue_cost = 0; /* For costing some adjacent vector loads, we'd like to cost with the total number of them once instead of cost each one by one. */ unsigned int n_adjacent_loads = 0; + if (slp_node) + ncopies = slp_node->vec_stmts_size / vec_num; for (j = 0; j < ncopies; j++) { if (costing_p) @@ -10884,6 +10911,15 @@ vectorizable_load (vec_info *vinfo, gimple_call_set_nothrow (call, true); vect_finish_stmt_generation (vinfo, stmt_info, call, gsi); + if (slp) + { + tree def = copy_ssa_name (gimple_vuse (call)); + gimple_set_vdef (call, def); + SSA_NAME_DEF_STMT (def) = call; + slp_node->push_vec_def (def); + } + else + { dr_chain.create (vec_num); /* Extract each vector into an SSA_NAME. */ for (i = 0; i < vec_num; i++) @@ -10900,8 +10936,10 @@ vectorizable_load (vec_info *vinfo, vect_clobber_variable (vinfo, stmt_info, gsi, vec_array); dr_chain.release (); + } - *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0]; + if (!slp_node) + *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0]; } if (costing_p) diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 8eb3ec4df86..ac288541c51 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -222,6 +222,9 @@ struct _slp_tree { unsigned int lanes; /* The operation of this node. */ enum tree_code code; + /* Whether uses of this load or feeders of this store are suitable + for load/store-lanes. */ + bool ldst_lanes; int vertex; -- 2.35.3