On Tue, 12 Nov 2024, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguent...@suse.de>
> > Sent: Tuesday, November 12, 2024 8:31 AM
> > To: Tamar Christina <tamar.christ...@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; RISC-V CI <patchworks...@rivosinc.com>
> > Subject: RE: [PATCH][v2] tree-optimization/117502 - VMAT_STRIDED_SLP vs
> > VMAT_ELEMENTWISE when considering gather
> > 
> > On Tue, 12 Nov 2024, Tamar Christina wrote:
> > 
> > > > -----Original Message-----
> > > > From: Richard Biener <rguent...@suse.de>
> > > > Sent: Monday, November 11, 2024 10:13 AM
> > > > To: gcc-patches@gcc.gnu.org
> > > > Cc: RISC-V CI <patchworks...@rivosinc.com>; Tamar Christina
> > > > <tamar.christ...@arm.com>
> > > > Subject: [PATCH][v2] tree-optimization/117502 - VMAT_STRIDED_SLP vs
> > > > VMAT_ELEMENTWISE when considering gather
> > > >
> > > > The following treats both the same when considering to use gather or
> > > > scatter for single-element interleaving accesses.
> > > >
> > > > This will cause
> > > >
> > > > FAIL: gcc.target/aarch64/sve/sve_iters_low_2.c scan-tree-dump-not vect
> > > > "LOOP VECTORIZED"
> > > >
> > > > where we now vectorize the loop with VNx4QI, I'll leave it to ARM folks
> > > > to investigate whether that's OK and to adjust the testcase or to see
> > > > where to adjust things to make the testcase not vectorized again.  The
> > > > original fix for which the testcase was introduced is still efffective.
> > >
> > > No, it shouldn't vectorize. It looks like this is a vectorizer bug,  with 
> > > this
> > > change we no longer see load costing for vect_body.
> > > As In, the vectorizer never asks for backend costing of the loads inside 
> > > the
> > > vector body.
> > 
> > I would doubt that, I see
> > 
> > *_7 4 times scalar_load costs 16 in body
> > *_8 1 times vector_load costs 4 in body
> > _9 != 0 1 times vector_stmt costs 2 in body
> > <unknown> 1 times vector_load costs 4 in prologue
> > .MASK_STORE (*_3, 32B, _28, t_19) 1 times vector_store costs 1 in body
> > 
> > so there is the vector_load cost in the body
> 
> Hmm didn't look at the dump file heh,
> But they never reach the backend costing hook though.. I'll track down why..

It should be passed to the target via vect_slp_analyze_operations doing

         if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
            {
              add_stmt_costs (loop_vinfo->vector_costs, &cost_vec);
              cost_vec.release ();

Richard.

> Thanks,
> Tamar
> 
> > 
> > > There are some curious things, like the costing done by vect_get_load_cost
> > doesn't seem
> > > to ever get slp_node so the cost structure used by 
> > > vect_slp_analyze_operations
> > > always has cost->node NULL.  The same is true for almost all the
> > record_stmt_cost
> > > calls in vectorizable_load.  So it's unclear to me if this behavior is 
> > > expected or a
> > bug.
> > 
> > Yes, slp_node is only passed down in very few select cases, so this is
> > expected (but you get a stmt_vinfo instead).
> > 
> > > Updating vect_get_load_cost to save the SLP node still doesn't get it 
> > > though.
> > >
> > > Also looking at the costing code, are these not wrong?:
> > >
> > >             if (costing_p)
> > >               {
> > >                 unsigned int cnunits = vect_nunits_for_cost (vectype);
> > >                 inside_cost
> > >                   = record_stmt_cost (cost_vec, cnunits, scalar_load,
> > >                                       stmt_info, 0, vect_body);
> > >                 continue;
> > >               }
> > >
> > > As if the VMAT is on the slp_node the costing needs to get it:
> > >
> > >             if (costing_p)
> > >               {
> > >                 unsigned int cnunits = vect_nunits_for_cost (vectype);
> > >                 inside_cost
> > >                   = record_stmt_cost (cost_vec, cnunits, scalar_load,
> > >                                       stmt_info, slp_node, 0, vect_body);
> > >                 continue;
> > >               }
> > >
> > > And a few others?
> > 
> > They are not "wrong", but indeed the information the target gets can be
> > incomplete if it wants to look at the VMAT_*.  The costing API is far
> > from perfect.
> > 
> > > But I haven't yet figured out why the vectorizer no longer asks to costs 
> > > the loads
> > > and stores in the vector loop.
> > 
> > It does - for me at least, but yes, without slp_node and thus without
> > the target knowing it's a gather load.
> > 
> > > If it's a bug I'll keep debugging, if it's not, how should the backend 
> > > cost these?
> > 
> > If the backend needs to know it's a gather then we need to pass down
> > the slp_node.  Note we present gather to the backend as N scalar loads,
> > this is from times where we did not have VMAT_* at all.  Possibly
> > when we know the target has a gather we don't need this but could
> > expect the target to look at the VMAT_* and thus simply use
> > a single vector_load for each actual gather we emit.  Changing this
> > requires changing all targets of course.
> > 
> > As said, the costing API is a mess.  But it's only step 2, after
> > only-SLP, to rework.
> > 
> > Richard.
> > 
> > >
> > > Thanks,
> > > Tamar
> > > >
> > > > Bootstrap and regtest running on x86_64-unknown-linux-gnu.
> > > >
> > > >         PR tree-optimization/117502
> > > >         * tree-vect-stmts.cc (get_group_load_store_type): Also consider
> > > >         VMAT_STRIDED_SLP when checking to use gather/scatter for
> > > >         single-element interleaving access.
> > > >         * tree-vect-loop.cc (update_epilogue_loop_vinfo):
> > > > STMT_VINFO_STRIDED_P
> > > >         can be classified as VMAT_GATHER_SCATTER, so update DR_REF for
> > > >         those as well.
> > > > ---
> > > >  gcc/tree-vect-loop.cc  | 1 +
> > > >  gcc/tree-vect-stmts.cc | 3 ++-
> > > >  2 files changed, 3 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > > index 6cfce5aa7e1..f50ee2e958e 100644
> > > > --- a/gcc/tree-vect-loop.cc
> > > > +++ b/gcc/tree-vect-loop.cc
> > > > @@ -12295,6 +12295,7 @@ update_epilogue_loop_vinfo (class loop
> > *epilogue,
> > > > tree advance)
> > > >          refs that get_load_store_type classified as 
> > > > VMAT_GATHER_SCATTER.  */
> > > >        auto vstmt_vinfo = vect_stmt_to_vectorize (stmt_vinfo);
> > > >        if (STMT_VINFO_MEMORY_ACCESS_TYPE (vstmt_vinfo) ==
> > > > VMAT_GATHER_SCATTER
> > > > +         || STMT_VINFO_STRIDED_P (vstmt_vinfo)
> > > >           || STMT_VINFO_GATHER_SCATTER_P (vstmt_vinfo))
> > > >         {
> > > >           /* ???  As we copy epilogues from the main loop incremental
> > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > index fa44e19f163..b09c016949f 100644
> > > > --- a/gcc/tree-vect-stmts.cc
> > > > +++ b/gcc/tree-vect-stmts.cc
> > > > @@ -2274,7 +2274,8 @@ get_group_load_store_type (vec_info *vinfo,
> > > > stmt_vec_info stmt_info,
> > > >       on nearby locations.  Or, even if it's a win over scalar code,
> > > >       it might not be a win over vectorizing at a lower VF, if that
> > > >       allows us to use contiguous accesses.  */
> > > > -  if (*memory_access_type == VMAT_ELEMENTWISE
> > > > +  if ((*memory_access_type == VMAT_ELEMENTWISE
> > > > +       || *memory_access_type == VMAT_STRIDED_SLP)
> > > >        && single_element_p
> > > >        && loop_vinfo
> > > >        && vect_use_strided_gather_scatters_p (stmt_info, loop_vinfo,
> > > > --
> > > > 2.43.0
> > >
> > 
> > --
> > Richard Biener <rguent...@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to