On Fri, 24 May 2024, Richard Biener wrote:

> This is the second merge proposed from the SLP vectorizer branch.
> I have again managed without adding and using --param vect-single-lane-slp
> but instead this provides always enabled functionality.
> 
> This makes us use SLP reductions (a group of reductions) for the
> case where the group size is one.  This basically means we try
> to use SLP for all reductions.
> 
> I've kept the series close to changes how they are on the branch
> but in the end I'll squash it, having separate commits for review
> eventually helps identifying common issues we will run into.  In
> particular we lack full SLP support for several reduction kinds
> and the branch has more enabling patches than in this series.
> For example 4/5 makes sure we use shifts and direct opcode
> reductions in the reduction epilog for SLP reductions but doesn't
> bother to try covering the general case but enables it only
> for the single-element group case to avoid regressions
> in gcc.dg/vect/reduc-{mul,or}_[12].c testcases.
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also
> successfully built SPEC CPU 2017.  This posting should trigger
> arm & riscv pre-checkin CI.
> 
> There's one ICE in gcc.target/i386/pr51235.c I discovered late
> that I will investigate and address after the weekend.

I've fixed this now.

On aarch64 and arm there's

FAIL: gcc.dg/vect/slp-reduc-3.c scan-tree-dump-times vect "VEC_PERM_EXPR" 
0

which is a testism, I _think_ due to a bogus vect_load_lanes check
in that line.  The code is as expected not using a SLP reduction of
two lanes due to the widen-sum pattern used.  It might be that we
somehow fail to use load-lanes when vectorizing the load with SLP
which means that for SLP reductions we fail to consider
load-lanes as override.  I think we should leave this FAIL, we need to
work to get load-lanes vectorization from SLP anyway.  To fix this
the load-permutation followup I have in the works will be necessary.

I also see

FAIL: gcc.target/aarch64/sve/dot_1.c scan-assembler-times \\twhilelo\\t 8
FAIL: gcc.target/aarch64/sve/reduc_4.c scan-assembler-not \\tfadd\\t
FAIL: gcc.target/aarch64/sve/sad_1.c scan-assembler-times 
\\tudot\\tz[0-9]+\\.s, z[0-9]+\\.b, z[0-9]+\\.b\\n 2

but scan-assemblers are not my favorite.  For example dot_1.c has
twice as many whilelo, but I'm not sure what goes wrong.

There are quite some regressions reported for RISC-V, I looked at the
ICEs and fixed them but I did not investigate any of the assembly
scanning FAILs.

I'll re-spin the series with the fixes tomorrow.
If anybody wants to point out something I should investigate please
speak up.

Thanks,
Richard.

> This change should be more straight-forward than the previous one,
> still comments are of course welcome.  After pushed I will followup
> with changes to enable single-lane SLP reductions for various
> COND_EXPR reductions as well as double-reduction support and
> in-order reduction support (also all restricted to single-lane
> for the moment).
>
> Thanks,
> Richard.
> 
> --
> 
> The following performs single-lane SLP discovery for reductions.
> This exposes a latent issue with reduction SLP in outer loop
> vectorization and makes gcc.dg/vect/vect-outer-4[fgkl].c FAIL
> execution.
> 
>       * tree-vect-slp.cc (vect_build_slp_tree_2): Only multi-lane
>       discoveries are reduction chains and need special backedge
>       treatment.
>       (vect_analyze_slp): Fall back to single-lane SLP discovery
>       for reductions.  Make sure to try single-lane SLP reduction
>       for all reductions as fallback.
> ---
>  gcc/tree-vect-slp.cc | 71 +++++++++++++++++++++++++++++++++-----------
>  1 file changed, 54 insertions(+), 17 deletions(-)
> 
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index c7ed520b629..73cc69d85ce 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -1907,7 +1907,8 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
>           /* Reduction chain backedge defs are filled manually.
>              ???  Need a better way to identify a SLP reduction chain PHI.
>              Or a better overall way to SLP match those.  */
> -         if (all_same && def_type == vect_reduction_def)
> +         if (stmts.length () > 1
> +             && all_same && def_type == vect_reduction_def)
>             skip_args[loop_latch_edge (loop)->dest_idx] = true;
>         }
>       else if (def_type != vect_internal_def)
> @@ -3905,9 +3906,10 @@ vect_analyze_slp (vec_info *vinfo, unsigned 
> max_tree_size)
>         }
>  
>        /* Find SLP sequences starting from groups of reductions.  */
> -      if (loop_vinfo->reductions.length () > 1)
> +      if (loop_vinfo->reductions.length () > 0)
>       {
> -       /* Collect reduction statements.  */
> +       /* Collect reduction statements we can combine into
> +          a SLP reduction.  */
>         vec<stmt_vec_info> scalar_stmts;
>         scalar_stmts.create (loop_vinfo->reductions.length ());
>         for (auto next_info : loop_vinfo->reductions)
> @@ -3920,25 +3922,60 @@ vect_analyze_slp (vec_info *vinfo, unsigned 
> max_tree_size)
>                    reduction path.  In that case we'd have to reverse
>                    engineer that conversion stmt following the chain using
>                    reduc_idx and from the PHI using reduc_def.  */
> -               && STMT_VINFO_DEF_TYPE (next_info) == vect_reduction_def
> -               /* Do not discover SLP reductions for lane-reducing ops, that
> -                  will fail later.  */
> -               && (!(g = dyn_cast <gassign *> (STMT_VINFO_STMT (next_info)))
> +               && STMT_VINFO_DEF_TYPE (next_info) == vect_reduction_def)
> +             {
> +               /* Do not discover SLP reductions combining lane-reducing
> +                  ops, that will fail later.  */
> +               if (!(g = dyn_cast <gassign *> (STMT_VINFO_STMT (next_info)))
>                     || (gimple_assign_rhs_code (g) != DOT_PROD_EXPR
>                         && gimple_assign_rhs_code (g) != WIDEN_SUM_EXPR
> -                       && gimple_assign_rhs_code (g) != SAD_EXPR)))
> -             scalar_stmts.quick_push (next_info);
> +                       && gimple_assign_rhs_code (g) != SAD_EXPR))
> +                 scalar_stmts.quick_push (next_info);
> +               else
> +                 {
> +                   /* Do SLP discovery for single-lane reductions.  */
> +                   vec<stmt_vec_info> stmts;
> +                   vec<stmt_vec_info> roots = vNULL;
> +                   vec<tree> remain = vNULL;
> +                   stmts.create (1);
> +                   stmts.quick_push (next_info);
> +                   vect_build_slp_instance (vinfo,
> +                                            slp_inst_kind_reduc_group,
> +                                            stmts, roots, remain,
> +                                            max_tree_size, &limit,
> +                                            bst_map, NULL);
> +                 }
> +             }
>           }
> -       if (scalar_stmts.length () > 1)
> +       /* Save for re-processing on failure.  */
> +       vec<stmt_vec_info> saved_stmts = scalar_stmts.copy ();
> +       vec<stmt_vec_info> roots = vNULL;
> +       vec<tree> remain = vNULL;
> +       if (scalar_stmts.length () <= 1
> +           || !vect_build_slp_instance (loop_vinfo,
> +                                        slp_inst_kind_reduc_group,
> +                                        scalar_stmts, roots, remain,
> +                                        max_tree_size, &limit, bst_map,
> +                                        NULL))
>           {
> -           vec<stmt_vec_info> roots = vNULL;
> -           vec<tree> remain = vNULL;
> -           vect_build_slp_instance (loop_vinfo, slp_inst_kind_reduc_group,
> -                                    scalar_stmts, roots, remain,
> -                                    max_tree_size, &limit, bst_map, NULL);
> +           if (scalar_stmts.length () <= 1)
> +             scalar_stmts.release ();
> +           /* Do SLP discovery for single-lane reductions.  */
> +           for (auto stmt_info : saved_stmts)
> +             {
> +               vec<stmt_vec_info> stmts;
> +               vec<stmt_vec_info> roots = vNULL;
> +               vec<tree> remain = vNULL;
> +               stmts.create (1);
> +               stmts.quick_push (vect_stmt_to_vectorize (stmt_info));
> +               vect_build_slp_instance (vinfo,
> +                                        slp_inst_kind_reduc_group,
> +                                        stmts, roots, remain,
> +                                        max_tree_size, &limit,
> +                                        bst_map, NULL);
> +             }
> +           saved_stmts.release ();
>           }
> -       else
> -         scalar_stmts.release ();
>       }
>      }
>  
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to