On Tue, 12 Nov 2024, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <rguent...@suse.de> > > Sent: Tuesday, November 12, 2024 8:31 AM > > To: Tamar Christina <tamar.christ...@arm.com> > > Cc: gcc-patches@gcc.gnu.org; RISC-V CI <patchworks...@rivosinc.com> > > Subject: RE: [PATCH][v2] tree-optimization/117502 - VMAT_STRIDED_SLP vs > > VMAT_ELEMENTWISE when considering gather > > > > On Tue, 12 Nov 2024, Tamar Christina wrote: > > > > > > -----Original Message----- > > > > From: Richard Biener <rguent...@suse.de> > > > > Sent: Monday, November 11, 2024 10:13 AM > > > > To: gcc-patches@gcc.gnu.org > > > > Cc: RISC-V CI <patchworks...@rivosinc.com>; Tamar Christina > > > > <tamar.christ...@arm.com> > > > > Subject: [PATCH][v2] tree-optimization/117502 - VMAT_STRIDED_SLP vs > > > > VMAT_ELEMENTWISE when considering gather > > > > > > > > The following treats both the same when considering to use gather or > > > > scatter for single-element interleaving accesses. > > > > > > > > This will cause > > > > > > > > FAIL: gcc.target/aarch64/sve/sve_iters_low_2.c scan-tree-dump-not vect > > > > "LOOP VECTORIZED" > > > > > > > > where we now vectorize the loop with VNx4QI, I'll leave it to ARM folks > > > > to investigate whether that's OK and to adjust the testcase or to see > > > > where to adjust things to make the testcase not vectorized again. The > > > > original fix for which the testcase was introduced is still efffective. > > > > > > No, it shouldn't vectorize. It looks like this is a vectorizer bug, with > > > this > > > change we no longer see load costing for vect_body. > > > As In, the vectorizer never asks for backend costing of the loads inside > > > the > > > vector body. > > > > I would doubt that, I see > > > > *_7 4 times scalar_load costs 16 in body > > *_8 1 times vector_load costs 4 in body > > _9 != 0 1 times vector_stmt costs 2 in body > > <unknown> 1 times vector_load costs 4 in prologue > > .MASK_STORE (*_3, 32B, _28, t_19) 1 times vector_store costs 1 in body > > > > so there is the vector_load cost in the body > > Hmm didn't look at the dump file heh, > But they never reach the backend costing hook though.. I'll track down why..
It should be passed to the target via vect_slp_analyze_operations doing if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo)) { add_stmt_costs (loop_vinfo->vector_costs, &cost_vec); cost_vec.release (); Richard. > Thanks, > Tamar > > > > > > There are some curious things, like the costing done by vect_get_load_cost > > doesn't seem > > > to ever get slp_node so the cost structure used by > > > vect_slp_analyze_operations > > > always has cost->node NULL. The same is true for almost all the > > record_stmt_cost > > > calls in vectorizable_load. So it's unclear to me if this behavior is > > > expected or a > > bug. > > > > Yes, slp_node is only passed down in very few select cases, so this is > > expected (but you get a stmt_vinfo instead). > > > > > Updating vect_get_load_cost to save the SLP node still doesn't get it > > > though. > > > > > > Also looking at the costing code, are these not wrong?: > > > > > > if (costing_p) > > > { > > > unsigned int cnunits = vect_nunits_for_cost (vectype); > > > inside_cost > > > = record_stmt_cost (cost_vec, cnunits, scalar_load, > > > stmt_info, 0, vect_body); > > > continue; > > > } > > > > > > As if the VMAT is on the slp_node the costing needs to get it: > > > > > > if (costing_p) > > > { > > > unsigned int cnunits = vect_nunits_for_cost (vectype); > > > inside_cost > > > = record_stmt_cost (cost_vec, cnunits, scalar_load, > > > stmt_info, slp_node, 0, vect_body); > > > continue; > > > } > > > > > > And a few others? > > > > They are not "wrong", but indeed the information the target gets can be > > incomplete if it wants to look at the VMAT_*. The costing API is far > > from perfect. > > > > > But I haven't yet figured out why the vectorizer no longer asks to costs > > > the loads > > > and stores in the vector loop. > > > > It does - for me at least, but yes, without slp_node and thus without > > the target knowing it's a gather load. > > > > > If it's a bug I'll keep debugging, if it's not, how should the backend > > > cost these? > > > > If the backend needs to know it's a gather then we need to pass down > > the slp_node. Note we present gather to the backend as N scalar loads, > > this is from times where we did not have VMAT_* at all. Possibly > > when we know the target has a gather we don't need this but could > > expect the target to look at the VMAT_* and thus simply use > > a single vector_load for each actual gather we emit. Changing this > > requires changing all targets of course. > > > > As said, the costing API is a mess. But it's only step 2, after > > only-SLP, to rework. > > > > Richard. > > > > > > > > Thanks, > > > Tamar > > > > > > > > Bootstrap and regtest running on x86_64-unknown-linux-gnu. > > > > > > > > PR tree-optimization/117502 > > > > * tree-vect-stmts.cc (get_group_load_store_type): Also consider > > > > VMAT_STRIDED_SLP when checking to use gather/scatter for > > > > single-element interleaving access. > > > > * tree-vect-loop.cc (update_epilogue_loop_vinfo): > > > > STMT_VINFO_STRIDED_P > > > > can be classified as VMAT_GATHER_SCATTER, so update DR_REF for > > > > those as well. > > > > --- > > > > gcc/tree-vect-loop.cc | 1 + > > > > gcc/tree-vect-stmts.cc | 3 ++- > > > > 2 files changed, 3 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > > > index 6cfce5aa7e1..f50ee2e958e 100644 > > > > --- a/gcc/tree-vect-loop.cc > > > > +++ b/gcc/tree-vect-loop.cc > > > > @@ -12295,6 +12295,7 @@ update_epilogue_loop_vinfo (class loop > > *epilogue, > > > > tree advance) > > > > refs that get_load_store_type classified as > > > > VMAT_GATHER_SCATTER. */ > > > > auto vstmt_vinfo = vect_stmt_to_vectorize (stmt_vinfo); > > > > if (STMT_VINFO_MEMORY_ACCESS_TYPE (vstmt_vinfo) == > > > > VMAT_GATHER_SCATTER > > > > + || STMT_VINFO_STRIDED_P (vstmt_vinfo) > > > > || STMT_VINFO_GATHER_SCATTER_P (vstmt_vinfo)) > > > > { > > > > /* ??? As we copy epilogues from the main loop incremental > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > > > > index fa44e19f163..b09c016949f 100644 > > > > --- a/gcc/tree-vect-stmts.cc > > > > +++ b/gcc/tree-vect-stmts.cc > > > > @@ -2274,7 +2274,8 @@ get_group_load_store_type (vec_info *vinfo, > > > > stmt_vec_info stmt_info, > > > > on nearby locations. Or, even if it's a win over scalar code, > > > > it might not be a win over vectorizing at a lower VF, if that > > > > allows us to use contiguous accesses. */ > > > > - if (*memory_access_type == VMAT_ELEMENTWISE > > > > + if ((*memory_access_type == VMAT_ELEMENTWISE > > > > + || *memory_access_type == VMAT_STRIDED_SLP) > > > > && single_element_p > > > > && loop_vinfo > > > > && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo, > > > > -- > > > > 2.43.0 > > > > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)