On Thu, 13 Mar 2025, Richard Sandiford wrote: > vect_slp_prefer_store_lanes_p allows an SLP tree to be split even > if the tree could use store-lanes, provided that one of the new > groups would operate on full vectors for each scalar iteration. > That heuristic is no longer firing for gcc.target/aarch64/pr99873_2.c. > > The test contains: > > void __attribute ((noipa)) > foo (uint64_t *__restrict x, uint64_t *__restrict y, int n) > { > for (int i = 0; i < n; i += 4) > { > x[i] += y[i]; > x[i + 1] += y[i + 1]; > x[i + 2] |= y[i + 2]; > x[i + 3] |= y[i + 3]; > } > } > > and wants us to use V2DI for the first two elements and V2DI for > the second two elements, rather than LD4s and ST4s. This gives: > > .L3: > ldp q31, q0, [x0] > add w3, w3, 1 > ldp q29, q30, [x1], 32 > orr v30.16b, v0.16b, v30.16b > add v31.2d, v29.2d, v31.2d > stp q31, q30, [x0], 32 > cmp w2, w3 > bhi .L3 > > instead of: > > .L4: > ld4 {v28.2d - v31.2d}, [x2] > ld4 {v24.2d - v27.2d}, [x3], 64 > add v24.2d, v28.2d, v24.2d > add v25.2d, v29.2d, v25.2d > orr v26.16b, v30.16b, v26.16b > orr v27.16b, v31.16b, v27.16b > st4 {v24.2d - v27.2d}, [x2], 64 > cmp x2, x5 > bne .L4 > > The first loop only handles half the amount of data per iteration, > but it requires far fewer internal permutations. > > One reason the heuristic no longer fired looks like a typo: the call > to vect_slp_prefer_store_lanes_p was passing "1" as the new group size, > instead of "i". > > However, even with that fixed, vect_analyze_slp later falls back on > single-lane SLP with load/store lanes. I think that heuristic too > should use vect_slp_prefer_store_lanes_p (but it otherwise looks good). > > The question is whether every load should pass > vect_slp_prefer_store_lanes_p or whether just one is enough. > I don't have an example that would make the call either way, > so I went for the latter, given that it's the smaller change > from the status quo. > > This also appears to improve fotonik3d and roms from SPEC2017 > (cross-checked against two different systems). > > Bootstrapped & regression-tested on aarch64-linux-gnu, where it > fixes the pr99873_2.c regression. OK to install?
risc-v is also heavily depending on load/store-lanes so please run this throug the riscv CI as well. Otherwise this is all heuristics, so I'm fine with this change. Thanks, Richard. > Richard > > > gcc/ > * tree-vect-slp.cc (vect_build_slp_instance): Pass the new group > size (i) rather than 1 to vect_slp_prefer_store_lanes_p. > (vect_analyze_slp): Only force the use of load-lanes and > store-lanes if that is preferred for at least one load/store pair. > --- > gcc/tree-vect-slp.cc | 19 ++++++++++++++++--- > 1 file changed, 16 insertions(+), 3 deletions(-) > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > index 9e09f8e980b..ecb4a6521de 100644 > --- a/gcc/tree-vect-slp.cc > +++ b/gcc/tree-vect-slp.cc > @@ -4161,7 +4161,7 @@ vect_build_slp_instance (vec_info *vinfo, > && ! STMT_VINFO_SLP_VECT_ONLY (stmt_info) > && compare_step_with_zero (vinfo, stmt_info) > 0 > && vect_slp_prefer_store_lanes_p (vinfo, stmt_info, NULL_TREE, > - masked_p, group_size, 1)); > + masked_p, group_size, i)); > if (want_store_lanes || force_single_lane) > i = 1; > > @@ -5095,7 +5095,7 @@ vect_analyze_slp (vec_info *vinfo, unsigned > max_tree_size, > && !SLP_INSTANCE_TREE (instance)->ldst_lanes) > { > slp_tree slp_root = SLP_INSTANCE_TREE (instance); > - int group_size = SLP_TREE_LANES (slp_root); > + unsigned int group_size = SLP_TREE_LANES (slp_root); > tree vectype = SLP_TREE_VECTYPE (slp_root); > > stmt_vec_info rep_info = SLP_TREE_REPRESENTATIVE (slp_root); > @@ -5138,6 +5138,7 @@ vect_analyze_slp (vec_info *vinfo, unsigned > max_tree_size, > if (loads_permuted) > { > bool can_use_lanes = true; > + bool prefer_load_lanes = false; > FOR_EACH_VEC_ELT (loads, j, load_node) > if (STMT_VINFO_GROUPED_ACCESS > (SLP_TREE_REPRESENTATIVE (load_node))) > @@ -5165,9 +5166,21 @@ vect_analyze_slp (vec_info *vinfo, unsigned > max_tree_size, > can_use_lanes = false; > break; > } > + /* Make sure that the target would prefer store-lanes > + for at least one of the loads. > + > + ??? Perhaps we should instead require this for > + all loads? */ > + prefer_load_lanes > + = (prefer_load_lanes > + || SLP_TREE_LANES (load_node) == group_size > + || (vect_slp_prefer_store_lanes_p > + (vinfo, stmt_vinfo, > + STMT_VINFO_VECTYPE (stmt_vinfo), masked, > + group_size, SLP_TREE_LANES (load_node)))); > } > > - if (can_use_lanes) > + if (can_use_lanes && prefer_load_lanes) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)