Re: [PATCH][vect]: support vectorization of early break forced live IVs as scalar

Richard Biener Tue, 28 Oct 2025 02:59:54 -0700

On Sat, 25 Oct 2025, Tamar Christina wrote:

> Consider this simple loop
> 
> long long arr[1024];
> long long *f()
> {
>     int i;
>     for (i = 0; i < 1024; i++)
>       if (arr[i] == 42)
>         break;
>     return arr + i;
> }
> 
> where today we generate this at -O3:
> 
> .L2:
>         add     v29.4s, v29.4s, v25.4s
>         add     v28.4s, v28.4s, v26.4s
>         cmp     x2, x1
>         beq     .L9
> .L6:
>         ldp     q30, q31, [x1], 32
>         cmeq    v30.2d, v30.2d, v27.2d
>         cmeq    v31.2d, v31.2d, v27.2d
>         addhn   v31.2s, v31.2d, v30.2d
>         fmov    x3, d31
>         cbz     x3, .L2
> 
> but which is highly inefficient.  This loops has 3 IVs (PR119577), one normal
> scalar one, two vector ones, one counting up and one counting down (PR115120)
> and has a forced unrolling due to an increase in VF because of the mismatch in
> modes between the IVs and the loop body (PR119860).
> 
> This patch fixed all three of these issues and we now generate:
> 
> .L2:
>         add     w2, w2, 2
>         cmp     w2, 1024
>         beq     .L13
> .L5:
>         ldr     q31, [x1]
>         add     x1, x1, 16
>         cmeq    v31.2d, v31.2d, v30.2d
>         umaxp   v31.4s, v31.4s, v31.4s
>         fmov    x0, d31
>         cbz     x0, .L2
> 
> or with sve
> 
> .L3:
>         add     x1, x1, x3
>         whilelo p7.d, w1, w2
>         b.none  .L11
> .L4:
>         ld1d    z30.d, p7/z, [x0, x1, lsl 3]
>         cmpeq   p7.d, p7/z, z30.d, z31.d
>         ptest   p15, p7.b
>         b.none  .L3
> 
> which shows that the new scalar IV is efficiently merged with the loop
> control one based on IVopts.
> 
> To accomplish this the patch reworks how we handle "forced lived inductions"
> with regard to vectorization.
> 
> Prior to this change when we vectorize a loop with early break any induction
> variables would be forced live.  Forcing live means that even though the 
> values
> aren't used inside the loop we must preserve the values such that when we 
> start
> the scalar loop we can pass the correct initial values.
> 
> However this had several side-effects:
> 
> 1. We must be able to vectorize the induction.
> 2. The induction variable participates in VF determination.  This would often
>    times lead to a higher VF than would have normally been needed.  As such 
> the
>    vector loops become less profitable.
> 3. IVcannon on constant loop iterations inserts a downward counting IV in
>    addition to the upwards one in order to support things like doloops.
>    Normally this duplicate IV is removed by IV opts, but IV doesn't understand
>    vector inductions.  As such we end up with 3 IVs.
> 
> This patch fixes all three of these by choosing instead to "vectorize" the
> forced live IVs as scalar statements.  This means we will recreate the scalar 
> IV
> and scale the value inside the loop by VF rather than step.
> 
> We have to create a new scalar IV because both LEN and Masked based
> vectorization can have loops where the VF changes between loop iterations, 
> which
> makes it impossible to determine in the early exits how many elements you've
> actually processed.  LEN based is easy to see how and for Masked loops this 
> can
> happen with First Faults as in this case you can get a partial result which
> needs ot be handled.
> 
> The new scalar IVs are represented as an SLP tree which has only DEFs and no
> scalar statements:
> 
> note:   === vect_analyze_slp ===
> note:   Final scalar def SLP tree for instance 0x52717e0:
> note:   node 0x52a7f40 (max_nunits=1, refcnt=1)
> note:   op template: i_10 = PHI <i_7(7), 0(2)>
> note:         { }
> 
> This SLP tree is treated essentially as an external def for most of SLP 
> handling
> but completely ignored by vect_make_slp_decision as it has no statements.
> 
> As we now have a scalar SLP node type, all code handling forced live 
> inductions
> are removed, and all code scattered around vectorizable_induction and
> vectorizable_live_operation are moved to a central location in a new function
> vectorizable_scalar_induction which is plumbed into the relevant places in
> statement analysis and code generation.
> 
> This makes the logic easier to follow as we no longer need to prepare 
> statements
> in different part of the code for use in vectorizable_live_operation.
> 
> The new code also supports all induction types.  As mentioned before the
> different IVs are later harmonized by IV opts based on addressing mode costs 
> and
> so we get better codegen as well, particularly for SVE which has support for
> complex addressing modes.
> 
> Lastly this also includes an easier to follow fix for a laten bug on trunk in
> that we are not correctly handling the main exit IVs because we were unable to
> distinguish between main exits that require the last element vs the
> one-before-last element.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu,
> arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
> -m32, -m64 and no issues
> 
> Ok for master?


Comments below.  In general I like that we move special cases to
special places.  I also belive it's good to code generate
vect_used_only_live (or "forced live") inductions better.

I was hoping that what exactly we put into 
LOOP_VINFO_EARLY_BREAKS_LIVE_IVS would become more clear.  IIRC
this is about vect_update_ivs_after_vectorizer not handling the
early break situation correctly.  So I had the idea that we'd
get the "actually executed number of scalar iterations" which
we in vect_update_ivs_after_vectorizer simply compute by
vector_niter * vf from the new magic IVs and thus not need the
other IVs live at all for early break (just the magic new one)?
But then this mixes with trying to handle vect_used_only_live,
thus actually live IVs?  It feels like dis-entangling both issues
would have helped me here.

> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>       PR tree-optimization/115120
>       PR tree-optimization/119577
>       PR tree-optimization/119860
>       * tree-vect-loop-manip.cc (vect_do_peeling): Store niters_vector_mult_vf
>       inside loop_vinfo.
>       * tree-vect-loop.cc (vect_peel_nonlinear_iv_init): Add some comments.
>       (vect_update_nonlinear_iv): use `unsigned_type_for` such that function
>       works for both vector and scalar types.
>       (vect_build_plus_adjustment, dissolve_scalar_iv_phi_nodes,
>       vectorizable_scalar_induction): New.
>       (vectorizable_induction, vectorizable_live_operation): Remove early
>       break IV handling code.
>       * tree-vect-slp.cc (vect_analyze_slp): Create new scalar IV SLP nodes.
>       (vect_slp_analyze_node_operations, vect_slp_analyze_operations,
>       vect_schedule_slp_node, vect_schedule_scc): Support scalar IV nodes and
>       instances.
>       * tree-vect-stmts.cc (vect_stmt_relevant_p): Don't force early break IVs
>       live but instead mark them.
>       (can_vectorize_live_stmts): Remove analysis of forced live nodes as they
>       no longer exist.
>       (vect_analyze_stmt, vect_transform_stmt): Support scalar inductions.
>       * tree-vectorizer.h (enum stmt_vec_info_type): Add scalar_iv_info_type.
>       (enum slp_instance_kind): Add slp_inst_kind_scalar_iv.
>       (class _loop_vec_info): Add niters_vector_mult_vf.
>       (LOOP_VINFO_VECTOR_NITERS_VF, vectorizable_scalar_induction): New.
> 
> gcc/testsuite/ChangeLog:
> 
>       PR tree-optimization/115120
>       PR tree-optimization/119577
>       PR tree-optimization/119860
>       * gcc.dg/vect/vect-early-break_39.c: Update.
>       * gcc.target/aarch64/sve/peel_ind_9.c: Update.
> 
> ---
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c 
> b/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c
> index 
> b3f40b8c9ba49e41bd283e46a462238c3b5825ef..bc862ad20e68db8f3c0ba6facf47e13a56a7cd6d
>  100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-early-break_39.c
> @@ -23,5 +23,6 @@ unsigned test4(unsigned x, unsigned n)
>   return ret;
>  }
>  
> -/* cannot safely vectorize this due due to the group misalignment.  */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 0 
> "vect" } } */
> +/* AArch64 will scalarize the load and is able to vectorize it.  */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 1 
> "vect" { target aarch64*-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops in function" 0 
> "vect" { target { ! aarch64*-*-* } } } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c
> index 
> cc904e88170f072e1d3c6be86643d99a7cd5cb12..14c7ea07b28e16dddebe4dc3743f90b32723c324
>  100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/peel_ind_9.c
> @@ -20,6 +20,7 @@ foo (void)
>  }
>  
>  /* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> -/* Peels using a scalar loop.  */
> -/* { dg-final { scan-tree-dump-not "pfa_iv_offset" "vect" } } */
> +/* Peels using fully masked loop.  */
> +/* { dg-final { scan-tree-dump "pfa_iv_offset" "vect" } } */
> +/* { dg-final { scan-tree-dump "misalignment for fully-masked loop" "vect" } 
> } */
>  /* { dg-final { scan-tree-dump "Alignment of access forced using peeling" 
> "vect" } } */
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index 
> 96ca273c24680556f16cdc9e465f490d7fcdb8a4..345d09d46185cb9e85e0b0d80ddff784a2802837
>  100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -3578,6 +3578,10 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree 
> niters, tree nitersm1,
>        else
>       vect_gen_vector_loop_niters_mult_vf (loop_vinfo, *niters_vector,
>                                            &niters_vector_mult_vf);
> +
> +      /* Store niters_vector_mult_vf for later use.  */
> +      LOOP_VINFO_VECTOR_NITERS_VF (loop_vinfo) = niters_vector_mult_vf;
> +
>        /* Update IVs of original loop as if they were advanced by
>        niters_vector_mult_vf steps.  */
>        gcc_checking_assert (vect_can_advance_ivs_p (loop_vinfo));
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 
> 9320bf8e878d22faa5e202311649ffc05dbd6094..003c801c0193365f29fda0ffafd5b06c1ea8ee29
>  100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -2596,6 +2596,9 @@ again:
>        if (SLP_TREE_DEF_TYPE (SLP_INSTANCE_TREE (instance)) != 
> vect_internal_def)
>       continue;
>  
> +      if (SLP_TREE_SCALAR_STMTS (SLP_INSTANCE_TREE (instance)).is_empty ())
> +     continue;
> +
>        stmt_vec_info vinfo;
>        vinfo = SLP_TREE_SCALAR_STMTS (SLP_INSTANCE_TREE (instance))[0];
>        if (! STMT_VINFO_GROUPED_ACCESS (vinfo))
> @@ -8945,6 +8948,11 @@ vect_peel_nonlinear_iv_init (gimple_seq* stmts, tree 
> init_expr,
>    unsigned prec = TYPE_PRECISION (type);
>    switch (induction_type)
>      {
> +    /* neg inductions are typically not used for loop termination conditions 
> but
> +       are typically implemented as b = -b.  That is every scalar iteration 
> b is
> +       negated.  That means that for the initial value of b we will have to
> +       determine whether the number of skipped iteration is a multiple of 2
> +       because every 2 scalar iterations we are back at "b".  */
>      case vect_step_op_neg:
>        if (TREE_INT_CST_LOW (skip_niters) % 2)
>       init_expr = gimple_build (stmts, NEGATE_EXPR, type, init_expr);
> @@ -9060,9 +9068,7 @@ vect_update_nonlinear_iv (gimple_seq* stmts, tree 
> vectype,
>      case vect_step_op_mul:
>        {
>       /* Use unsigned mult to avoid UD integer overflow.  */
> -     tree uvectype
> -       = build_vector_type (unsigned_type_for (TREE_TYPE (vectype)),
> -                            TYPE_VECTOR_SUBPARTS (vectype));
> +     tree uvectype = unsigned_type_for (vectype);
>       vec_def = gimple_convert (stmts, uvectype, vec_def);
>       vec_step = gimple_convert (stmts, uvectype, vec_step);
>       vec_def = gimple_build (stmts, MULT_EXPR, uvectype,
> @@ -9404,6 +9410,423 @@ vectorizable_nonlinear_induction (loop_vec_info 
> loop_vinfo,
>    return true;
>  }
>  
> +/* Create an adjustment from BASE to BASE + OFFSET with type TYPE.
> +   if BASE and oFFSET are not the same type, emit conversions into STMTS.  If
> +   BASE is a POINTER_TYPE_P then use a POINTER_PLUS_EXPR instead of PLUS_EXPR
> +   and convert OFFSET to the appropriate type.  */
> +static tree
> +vect_build_plus_adjustment (gimple_seq *stmts, tree type, tree base,
> +                         tree offset)
> +{
> +  if (POINTER_TYPE_P (type))
> +    {
> +      offset = gimple_convert (stmts, sizetype, offset);
> +      return gimple_build (stmts, POINTER_PLUS_EXPR, type, base,
> +                        gimple_convert (stmts, sizetype, offset));
> +    }
> +  else
> +    {
> +      offset = gimple_convert (stmts, type, offset);
> +      return gimple_build (stmts, PLUS_EXPR, type, base, offset);
> +    }
> +}
> +
> +/* This function is only useful for updating PHI nodes wrt to early break
> +   blocks.  This functions updates blocks such as
> +
> +   BB x:
> +     y = PHI<DEF, DEF, ...>
> +
> +   into
> +     y = NEW_IV
> +
> +   or
> +     y = RECOMP_IV
> +
> +   depending on whether the value occurs in a block where we expect the 
> value of
> +   the current scalar iteration or the previous one.  If we need the value of
> +   RECOMP_IV then E_STMTS are first emitted in order to create the values.
> +   Otherwise we elide it to keep the entries in the emitted BB cleaner.  */
> +void static
> +dissolve_scalar_iv_phi_nodes (loop_vec_info loop_vinfo, tree def,
> +                           tree new_iv, tree recomp_iv, gimple_seq &e_stmts)
> +{
> +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +
> +  imm_use_iterator imm_iter;
> +  gimple *use_stmt;
> +  use_operand_p use_p;
> +  FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, def)
> +    if (!is_gimple_debug (use_stmt)
> +     && !flow_bb_inside_loop_p (loop, gimple_bb (use_stmt)))
> +      FOR_EACH_IMM_USE_ON_STMT (use_p, imm_iter)
> +         {

The indenting is off here.

> +           edge e = gimple_phi_arg_edge (as_a <gphi *> (use_stmt),
> +                                         phi_arg_index_from_use (use_p));
> +           gcc_assert (loop_exit_edge_p (loop, e));
> +           auto exit_gsi = gsi_last_nondebug_bb (e->dest);
> +           auto stmt = gsi_stmt (exit_gsi);
> +           /* We need to insert at the end, but can't do so across the
> +              jump.  */
> +           if (stmt && !is_a <gcond *>(stmt))
> +             gsi_next (&exit_gsi);
> +           tree lhs_phi = gimple_phi_result (use_stmt);
> +           auto gsi = gsi_for_stmt (use_stmt);
> +           remove_phi_node (&gsi, false);
> +           tree iv_var = new_iv;
> +           if (recomp_iv
> +               && !LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo)
> +               && LOOP_VINFO_IV_EXIT (loop_vinfo) == e)
> +             {
> +               /* Emit any extra statement that may be needed to use
> +                  recomp_iv.  */
> +               if (e_stmts)
> +                 gsi_insert_seq_before (&exit_gsi, e_stmts, GSI_SAME_STMT);
> +               iv_var = recomp_iv;
> +               e_stmts = NULL;
> +             }
> +           gimple *copy = gimple_build_assign (lhs_phi, iv_var);
> +           gsi_insert_before (&exit_gsi, copy, GSI_SAME_STMT);
> +           break;
> +         }
> +}
> +
> +/* Function vectorizable_scalar_induction
> +
> +   Check if STMT_INFO performs an scalar induction computation that can be
> +   used by early break vectorization where we need to know the starting 
> value of
> +   the IV. If VEC_STMT is also passed, vectorize the induction PHI: create
> +   a "vectorized" scalar phi to replace it, put it in VEC_STMT, and add it to
> +   the same basic block.
> +   Return true if STMT_INFO is vectorizable in this way.  */
> +
> +bool
> +vectorizable_scalar_induction (loop_vec_info loop_vinfo,
> +                            stmt_vec_info stmt_info,
> +                            slp_tree slp_node,
> +                            stmt_vector_for_cost *cost_vec)
> +{
> +  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  class loop *iv_loop;
> +  tree vec_def;
> +  edge pe = loop_preheader_edge (loop);
> +  tree vec_init;
> +  gphi *induction_phi;
> +  tree induc_def, vec_dest;
> +  tree init_expr, step_expr, iv_step;
> +  tree niters_skip = NULL_TREE;
> +  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +  gimple_stmt_iterator si;
> +
> +  gphi *phi = dyn_cast <gphi *> (stmt_info->stmt);

as_a <gphi *>

> +
> +  enum vect_induction_op_type induction_type
> +    = STMT_VINFO_LOOP_PHI_EVOLUTION_TYPE (stmt_info);
> +
> +  /* FORNOW. Only handle nonlinear induction in the same loop.  */
> +  if (nested_in_vect_loop_p (loop, stmt_info)
> +      && induction_type != vect_step_op_add)
> +    {
> +      if (dump_enabled_p ())
> +     dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                      "nonlinear induction in nested loop.\n");

do we handle early breaks in nested cycles?

> +      return false;
> +    }
> +
> +  iv_loop = loop;
> +  gcc_assert (iv_loop == (gimple_bb (phi))->loop_father);
> +
> +  step_expr = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (stmt_info);
> +  init_expr = vect_phi_initial_value (phi);
> +  gcc_assert (init_expr != NULL);
> +
> +  /* A scalar IV with no step means it doesn't evolve.  Just
> +     set it to 0.  This makes follow up adjustments easier as 0 just folds 
> them
> +     away.  */

Hmm, it means it isn't an induction we were able to analyze?  Because
we put even non-vect_induction_def PHIs into 
LOOP_VINFO_EARLY_BREAKS_LIVE_IVS?  The condition is

  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
      && is_a <gphi *> (stmt)
      && gimple_bb (stmt) == LOOP_VINFO_LOOP (loop_vinfo)->header
      && ((! VECTORIZABLE_CYCLE_DEF (STMT_VINFO_DEF_TYPE (stmt_info))
          && ! *live_p)
          || STMT_VINFO_DEF_TYPE (stmt_info) == vect_induction_def))

and you build a scalar induction SLP instance when when the def type
isn't vect_induction_def.  It could be a !live PHI of a reduction
or an unknown def type?  Should we play safe here and fail if the
def type isnt vect_induction_def?  STMT_VINFO_LOOP_PHI_EVOLUTION_TYPE
is initialized to zero (vect_step_op_add).


> +  if (!step_expr)
> +    build_zero_cst (TREE_TYPE (init_expr));
> +
> +  if (cost_vec) /* transformation not required.  */
> +    {
> +      switch (induction_type)
> +     {
> +       case vect_step_op_add:
> +       case vect_step_op_mul:
> +       case vect_step_op_shl:
> +       case vect_step_op_shr:
> +       case vect_step_op_neg:
> +         break;
> +       default:
> +         {
> +           if (dump_enabled_p ())
> +             dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                      "Unsupported scalar induction type for early break.");
> +           return false;
> +         }
> +     }
> +
> +      /* We don't perform any costing here because it's impossible to tell 
> the
> +      sequence of instructions needed for the diffirent induction types.  In
> +      addition the expectation is that IVopts will unify the IVs so the final
> +      cost isn't known here yet.  Lastly most of the cost models will
> +      interpret scalar instructions during vect_body as vector statements and
> +      as such the cost of the loop becomes quite unrealistic.   */
> +
> +      SLP_TREE_TYPE (slp_node) = scalar_iv_info_type;
> +      DUMP_VECT_SCOPE ("vectorizable_scalar_induction");
> +      return true;
> +    }
> +
> +  /* Transform.  */
> +
> +  /* Compute a scalar variable that represents the number of scalar 
> iterations
> +     the vector code has performed at the end of the relevant exit.  For 
> early
> +     exits we transform to the value at the start of the last vector 
> iteration.
> +     For non-early exit the value depends on whether the main exit is
> +     calculating the i + i or i. i.e. the last value, or the value after 
> last.
> +     This is determined by which LCSSA variable is found in the latch exit.  
> */
> +
> +  if (dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location, "transform scalar induction 
> phi.\n");
> +
> +  pe = loop_preheader_edge (iv_loop);
> +  /* Find the first insertion point in the BB.  */
> +  basic_block bb = gimple_bb (phi);
> +  si = gsi_after_labels (bb);
> +
> +  gimple_seq stmts = NULL;
> +  gimple_seq init_stmts = NULL;
> +  gimple_seq iv_stmts = NULL;
> +
> +  niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> +  tree ty_niters_skip = niters_skip ? TREE_TYPE (niters_skip) : NULL_TREE;
> +
> +  /* Create the induction-phi that defines the induction-operand.  */
> +  tree scalar_type = TREE_TYPE (PHI_RESULT (phi));
> +  vec_dest = vect_get_new_vect_var (scalar_type, vect_scalar_var, 
> "scal_iv_");
> +  induction_phi = create_phi_node (vec_dest, iv_loop->header);
> +  induc_def = PHI_RESULT (induction_phi);
> +
> +  /* Create the iv update inside the loop.  */
> +  stmts = NULL;
> +  tree tree_vf = build_int_cst (scalar_type, vf);
> +  if (SCALAR_FLOAT_TYPE_P (scalar_type))
> +    tree_vf = gimple_convert (&init_stmts, scalar_type, tree_vf);

should be in the else of the SELECT_VL_P case

> +
> +  /* For loop len targets we have to use .SELECT_VL (ivtmp_33, VF); instead 
> of
> +     just += VF as the VF can change in between two loop iterations.  */
> +  if (LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo))
> +    {
> +      vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
> +      tree_vf = vect_get_loop_len (loop_vinfo, NULL, lens, 1,
> +                                NULL_TREE, 0, 0);
> +    }
> +
> +  /* Create the following def-use cycle:
> +     loop prolog:
> +     scalar_init = ...
> +     scalar_step = ...
> +     loop:
> +     scalar_iv = PHI <scalar_init, vec_loop>
> +     ...
> +     STMT
> +     ...
> +     vec_loop = scalar_iv + scalar_step;  */
> +  switch (induction_type)
> +  {
> +    case vect_step_op_add:
> +      {
> +     if (niters_skip)
> +       vec_init
> +         = vect_build_plus_adjustment (&init_stmts, scalar_type, init_expr,
> +                     gimple_convert (&init_stmts, scalar_type,
> +                       gimple_build (&init_stmts, MINUS_EXPR, ty_niters_skip,
> +                                     build_zero_cst (ty_niters_skip),
> +                                     niters_skip)));
> +     else
> +       vec_init = init_expr;
> +
> +     /* Lets do step * VF as the induction step to get a chance to CSE it.  
> */
> +     vec_def
> +         = vect_build_plus_adjustment (&stmts, scalar_type, induc_def,
> +             gimple_build (&stmts, MULT_EXPR, scalar_type, step_expr,
> +                           tree_vf));
> +     break;
> +      }
> +    case vect_step_op_mul:
> +    case vect_step_op_shl:
> +    case vect_step_op_shr:
> +    case vect_step_op_neg:
> +      {
> +     if (niters_skip)
> +       vec_init = vect_peel_nonlinear_iv_init (&init_stmts, init_expr,
> +                                               niters_skip, step_expr,
> +                                               induction_type);
> +     else
> +       vec_init = init_expr;
> +
> +     iv_step = vect_create_nonlinear_iv_step (&init_stmts, init_expr, vf,
> +                                              induction_type);
> +     vec_def = vect_update_nonlinear_iv (&stmts, scalar_type, induc_def,
> +                                         iv_step, induction_type);
> +     break;
> +      }
> +    default:
> +      gcc_unreachable ();
> +  }
> +
> +  /* If early break then we have to create a new PHI which we can use as
> +     an offset to adjust the induction reduction in early exits.
> +
> +     This is because when peeling for alignment using masking, the first
> +     few elements of the vector can be inactive.  As such if we find the
> +     entry in the first iteration we have adjust the starting point of
> +     the scalar code.
> +
> +     We do this by creating a new scalar PHI that keeps track of whether
> +     we are the first iteration of the loop (with the additional masking)
> +     or whether we have taken a loop iteration already.
> +
> +    The generated sequence:
> +
> +    pre-header:
> +     bb1:
> +       i_1 = <number of leading inactive elements>
> +
> +     header:
> +     bb2:
> +       i_2 = PHI <i_1(bb1), 0(latch)>
> +       …
> +
> +     early-exit:
> +     bb3:
> +       i_3 = iv_step * i_2 + PHI<vector-iv>
> +
> +     The first part of the adjustment to create i_1 and i_2 are done here
> +     and the last part creating i_3 is done in
> +     vectorizable_live_operations when the induction extraction is
> +     materialized.  */
> +  if (niters_skip
> +      && !LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo))
> +    {
> +      tree ty_skip_niters = TREE_TYPE (niters_skip);
> +      tree break_lhs_phi
> +     = vect_get_new_vect_var (ty_skip_niters, vect_scalar_var,
> +                              "pfa_iv_offset");
> +      gphi *nphi = create_phi_node (break_lhs_phi, bb);
> +      add_phi_arg (nphi, niters_skip, pe, UNKNOWN_LOCATION);
> +      add_phi_arg (nphi, build_zero_cst (ty_skip_niters),
> +                    loop_latch_edge (iv_loop), UNKNOWN_LOCATION);
> +
> +      LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo) = PHI_RESULT (nphi);
> +    }
> +

Hmm, I thought we adjust induction initial values with the PFA offset
already above?

> +  /* Write the init_stmts in the loop-preheader block.  */
> +  auto psi = gsi_last_nondebug_bb (pe->src);
> +  gsi_insert_seq_after (&psi, init_stmts, GSI_LAST_NEW_STMT);
> +  /* Wite the adjustments in the loop header block.  */
> +  gsi_insert_seq_before (&si, stmts, GSI_SAME_STMT);
> +  tree induc_step_def
> +    = gimple_phi_arg_def_from_edge (phi, loop_latch_edge (iv_loop));
> +
> +  /* Set the arguments of the phi node:  */
> +  add_phi_arg (induction_phi, vec_init, pe, UNKNOWN_LOCATION);
> +  add_phi_arg (induction_phi, vec_def, loop_latch_edge (iv_loop),
> +            UNKNOWN_LOCATION);
> +
> +  /* If we've done any peeling, calculate the peeling adjustment needed to 
> the
> +     final IV.  */
> +  if (LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo))
> +    {
> +      tree step_expr
> +     = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (stmt_info);
> +      tree break_lhs_phi
> +     = LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo);
> +      tree ty_skip_niters = TREE_TYPE (break_lhs_phi);
> +
> +      /* Now create the PHI for the outside loop usage to
> +      retrieve the value for the offset counter.  */
> +      tree rphi_step
> +     = gimple_convert (&iv_stmts, ty_skip_niters, step_expr);
> +      tree tmp2
> +     = gimple_build (&iv_stmts, MULT_EXPR,
> +                     ty_skip_niters, rphi_step,
> +                     break_lhs_phi);
> +
> +      induc_def = vect_build_plus_adjustment (&iv_stmts, TREE_TYPE 
> (induc_def),
> +                                           induc_def, tmp2);
> +
> +    basic_block exit_bb = NULL;
> +    /* Identify the early exit merge block.  I wish we had stored this.  */
> +    for (auto e : get_loop_exit_edges (iv_loop))
> +      if (e != LOOP_VINFO_IV_EXIT (loop_vinfo))
> +     exit_bb = e->dest;
> +
> +    gcc_assert (exit_bb);
> +    auto exit_gsi = gsi_after_labels (exit_bb);
> +    gsi_insert_seq_before (&exit_gsi, iv_stmts, GSI_SAME_STMT);
> +  }
> +
> +  tree indec_vec_def = vec_def;
> +  tree recomp_induc_def = NULL_TREE;
> +  gimple_seq e_stmts = NULL;
> +  if (LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo))
> +    indec_vec_def = induc_def;
> +  else
> +    {
> +      /* When doing early-break we have to account for the situation where 
> the
> +      loop structure is essentially:
> +
> +      x_1 = PHI<x, y>
> +      ...
> +      x_2 = x_1 + step
> +
> +     and the value returned in the latch exit is x_1 instead fo x_2.   This
> +     happens a lot with Fortran because it's arrays aren't 0 based.  We will
> +     generate the statements, but only emit them if they are needed.
> +
> +     We use niters_vector_mult_vf because helpers like
> +     vect_gen_vector_loop_niters_mult_vf have already calculated the correct
> +     number of vector iterations in this scenario which makes the adjustment
> +     easier.  If the starting index ends up being 0 then this is all folded
> +     away.  */

Are we confusing live inductions with inductions made live due to early
break splitting the loop (guess a similar situation would occur with
epilogue vectorization for not live inductions, but there the IV
peeling code does the update)?  How do live inductions get here?  If
returning x_1 the PHI should be *live_p and thus no early break scalar
IV created?

Or is this supposed to be general handling of vect_used_only_live
inductions (or for nested cycles, vect_used_in_outer[_by_reduction)?

> +      tree niters_vf
> +     = gimple_convert (&e_stmts, scalar_type,
> +                       LOOP_VINFO_VECTOR_NITERS_VF (loop_vinfo));
> +      tree step = gimple_convert (&e_stmts, scalar_type, step_expr);
> +
> +      /* n_minus_1 = max(niters_vf - 1, 0) to be safe when niters_vf == 0.  
> */
> +      tree n_minus_1 = gimple_build (&e_stmts, MINUS_EXPR, scalar_type,
> +                                  niters_vf, build_one_cst (scalar_type));
> +
> +      /* delta = (niters_vf - 1) * step.   */
> +      tree delta = gimple_build (&e_stmts, MULT_EXPR, scalar_type, n_minus_1,
> +                              step);
> +
> +      /* j_exit = init_expr + (niters_vf - 1) * step.  */
> +      recomp_induc_def = vect_build_plus_adjustment (&e_stmts, scalar_type,
> +                                                  init_expr, delta);
> +    }
> +
> +  /* We have to dissolve the PHI back to an assignment since PHIs are always
> +     at the start of the block.  This is safe due to all early exits being
> +     pushed to the same block.  As such the PHI elements are all the same.  
> */
> +  dissolve_scalar_iv_phi_nodes (loop_vinfo, PHI_RESULT (phi), induc_def,
> +                             recomp_induc_def, e_stmts);

So we're basically inserting compensation code on the exit edge(s?).  
Doing it this way sounds a bit ugly.  The "usual" way would be to
transform

<want to modify _2>

# _1 = PHI <_2, _2>

into

# _3 = PHI <_2, _2>
modify _3
_1 = last modify stmt;

or replace all uses of _1 with the modify result.  Btw, you seem to
insert IVs and uses outside in the exit_bb but you don't add a LC
PHI node there?  So what PHIs are you dissolving here?  It's quite
hard to follow the code generation in this function.

> +
> +  /* Rewrite any usage of the latch iteration PHI if present.  */
> +  dissolve_scalar_iv_phi_nodes (loop_vinfo, induc_step_def, indec_vec_def,
> +                             NULL_TREE, e_stmts);
> +
> +  slp_node->push_vec_def (induction_phi);

Your SLP build already did this?  You are probably lucky that the
quick push done here succeeds.  Is this "vector def" actually
used somewhere?  If not I'd simply not populate it at all, but for
"scalar SLP instances" rely on SLP_TREE_SCALAR_STMTS.

> +
> +  if (dump_enabled_p ())
> +    dump_printf_loc (MSG_NOTE, vect_location,
> +                  "transform scalar induction: created def-use cycle: %G%T",
> +                  (gimple *) induction_phi, vec_def);
> +  return true;
> +}
> +
>  /* Function vectorizable_induction
>  
>     Check if STMT_INFO performs an induction computation that can be 
> vectorized.
> @@ -9690,53 +10113,6 @@ vectorizable_induction (loop_vec_info loop_vinfo,
>                                  LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo));
>        peel_mul = gimple_build_vector_from_val (&init_stmts,
>                                              step_vectype, peel_mul);
> -
> -      /* If early break then we have to create a new PHI which we can use as
> -      an offset to adjust the induction reduction in early exits.
> -
> -      This is because when peeling for alignment using masking, the first
> -      few elements of the vector can be inactive.  As such if we find the
> -      entry in the first iteration we have adjust the starting point of
> -      the scalar code.
> -
> -      We do this by creating a new scalar PHI that keeps track of whether
> -      we are the first iteration of the loop (with the additional masking)
> -      or whether we have taken a loop iteration already.
> -
> -      The generated sequence:
> -
> -      pre-header:
> -        bb1:
> -          i_1 = <number of leading inactive elements>
> -
> -        header:
> -        bb2:
> -          i_2 = PHI <i_1(bb1), 0(latch)>
> -          …
> -
> -        early-exit:
> -        bb3:
> -          i_3 = iv_step * i_2 + PHI<vector-iv>
> -
> -      The first part of the adjustment to create i_1 and i_2 are done here
> -      and the last part creating i_3 is done in
> -      vectorizable_live_operations when the induction extraction is
> -      materialized.  */
> -      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> -       && !LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo))
> -     {
> -       auto skip_niters = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
> -       tree ty_skip_niters = TREE_TYPE (skip_niters);
> -       tree break_lhs_phi = vect_get_new_vect_var (ty_skip_niters,
> -                                                   vect_scalar_var,
> -                                                   "pfa_iv_offset");
> -       gphi *nphi = create_phi_node (break_lhs_phi, bb);
> -       add_phi_arg (nphi, skip_niters, pe, UNKNOWN_LOCATION);
> -       add_phi_arg (nphi, build_zero_cst (ty_skip_niters),
> -                    loop_latch_edge (iv_loop), UNKNOWN_LOCATION);
> -
> -       LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo) = PHI_RESULT (nphi);
> -     }

Hmm, I see - this was pre-existing.

>      }
>    tree step_mul = NULL_TREE;
>    unsigned ivn;
> @@ -10312,8 +10688,7 @@ vectorizable_live_operation (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>                to the latch then we're restarting the iteration in the
>                scalar loop.  So get the first live value.  */
>             bool early_break_first_element_p
> -             = (all_exits_as_early_p || !main_exit_edge)
> -                && STMT_VINFO_DEF_TYPE (stmt_info) == vect_induction_def;
> +             = all_exits_as_early_p || !main_exit_edge;
>             if (early_break_first_element_p)
>               {
>                 tmp_vec_lhs = vec_lhs0;
> @@ -10322,52 +10697,13 @@ vectorizable_live_operation (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>  
>             gimple_stmt_iterator exit_gsi;
>             tree new_tree
> -             = vectorizable_live_operation_1 (loop_vinfo,
> -                                              e->dest, vectype,
> -                                              slp_node, bitsize,
> -                                              tmp_bitstart, tmp_vec_lhs,
> -                                              lhs_type, &exit_gsi);
> +               = vectorizable_live_operation_1 (loop_vinfo,
> +                                                e->dest, vectype,
> +                                                slp_node, bitsize,
> +                                                tmp_bitstart, tmp_vec_lhs,
> +                                                lhs_type, &exit_gsi);
>  
>             auto gsi = gsi_for_stmt (use_stmt);
> -           if (early_break_first_element_p
> -               && LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo))
> -             {
> -               tree step_expr
> -                 = STMT_VINFO_LOOP_PHI_EVOLUTION_PART (stmt_info);
> -               tree break_lhs_phi
> -                 = LOOP_VINFO_MASK_NITERS_PFA_OFFSET (loop_vinfo);
> -               tree ty_skip_niters = TREE_TYPE (break_lhs_phi);
> -               gimple_seq iv_stmts = NULL;
> -
> -               /* Now create the PHI for the outside loop usage to
> -                  retrieve the value for the offset counter.  */
> -               tree rphi_step
> -                 = gimple_convert (&iv_stmts, ty_skip_niters, step_expr);
> -               tree tmp2
> -                 = gimple_build (&iv_stmts, MULT_EXPR,
> -                                 ty_skip_niters, rphi_step,
> -                                 break_lhs_phi);
> -
> -               if (POINTER_TYPE_P (TREE_TYPE (new_tree)))
> -                 {
> -                   tmp2 = gimple_convert (&iv_stmts, sizetype, tmp2);
> -                   tmp2 = gimple_build (&iv_stmts, POINTER_PLUS_EXPR,
> -                                        TREE_TYPE (new_tree), new_tree,
> -                                        tmp2);
> -                 }
> -               else
> -                 {
> -                   tmp2 = gimple_convert (&iv_stmts, TREE_TYPE (new_tree),
> -                                          tmp2);
> -                   tmp2 = gimple_build (&iv_stmts, PLUS_EXPR,
> -                                        TREE_TYPE (new_tree), new_tree,
> -                                        tmp2);
> -                 }
> -
> -               new_tree = tmp2;
> -               gsi_insert_seq_before (&exit_gsi, iv_stmts, GSI_SAME_STMT);
> -             }
> -
>             tree lhs_phi = gimple_phi_result (use_stmt);
>             remove_phi_node (&gsi, false);
>             gimple *copy = gimple_build_assign (lhs_phi, new_tree);
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 
> 9698709f5671971c35a50a16a258874beb44514a..e050f34d2578ed9168ff30ec02ca746c132a08fe
>  100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -5620,6 +5620,44 @@ vect_analyze_slp (vec_info *vinfo, unsigned 
> max_tree_size,
>             }
>         }
>  
> +      /* Find and create slp instances for inductions that have been forced
> +      live due to early break.  */
> +      edge latch_e = loop_latch_edge (LOOP_VINFO_LOOP (loop_vinfo));
> +      for (auto stmt_info : LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo))
> +       {
> +         gphi *phi = as_a<gphi *> (STMT_VINFO_STMT (stmt_info));
> +         tree def = gimple_phi_arg_def_from_edge (phi, latch_e);
> +
> +         slp_tree node = vect_create_new_slp_node (vNULL);
> +         SLP_TREE_VECTYPE (node) = NULL_TREE;
> +         SLP_TREE_LANES (node) = 1;
> +         SLP_TREE_DEF_TYPE (node) = vect_internal_def;
> +         SLP_TREE_VEC_DEFS (node).safe_push (def);
> +         SLP_TREE_REPRESENTATIVE (node) = stmt_info;

any reason you do not populate SLP_TREE_SCALAR_STMTS?  The PHI itself
is the actual value of lane 0, so it would be a perfect fit?  I
noticed this because of the hunk in vect_slp_analyze_operations
dealing with empty scalar-stmts (which is OK, we _can_ have no
stmts there - I had to work around issues with the reduction chain
code, but hit that elsewhere IIRC).

> +
> +         /* Create a new SLP instance.  */
> +         slp_instance new_instance = XNEW (class _slp_instance);
> +         SLP_INSTANCE_TREE (new_instance) = node;
> +         SLP_INSTANCE_LOADS (new_instance) = vNULL;
> +         SLP_INSTANCE_ROOT_STMTS (new_instance) = vNULL;
> +         SLP_INSTANCE_REMAIN_DEFS (new_instance) = vNULL;
> +         SLP_INSTANCE_KIND (new_instance) = slp_inst_kind_scalar_iv;
> +         new_instance->reduc_phis = NULL;
> +         new_instance->cost_vec = vNULL;
> +         new_instance->subgraph_entries = vNULL;
> +
> +         vinfo->slp_instances.safe_push (new_instance);
> +
> +         if (dump_enabled_p ())
> +           {
> +             dump_printf_loc (MSG_NOTE, vect_location,
> +                              "Final scalar def SLP tree for instance %p:\n",
> +                              (void *) new_instance);
> +             vect_print_slp_graph (MSG_NOTE, vect_location,
> +                                   SLP_INSTANCE_TREE (new_instance));
> +           }
> +       }
> +
>        /* Find SLP sequences starting from gconds.  */
>        for (auto cond : LOOP_VINFO_LOOP_CONDS (loop_vinfo))
>       {
> @@ -5664,48 +5702,6 @@ vect_analyze_slp (vec_info *vinfo, unsigned 
> max_tree_size,
>                                            "SLP build failed.\n");
>           }
>       }
> -
> -     /* Find and create slp instances for inductions that have been forced
> -        live due to early break.  */
> -     edge latch_e = loop_latch_edge (LOOP_VINFO_LOOP (loop_vinfo));
> -     for (auto stmt_info : LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo))
> -       {
> -         vec<stmt_vec_info> stmts;
> -         vec<stmt_vec_info> roots = vNULL;
> -         vec<tree> remain = vNULL;
> -         gphi *phi = as_a<gphi *> (STMT_VINFO_STMT (stmt_info));
> -         tree def = gimple_phi_arg_def_from_edge (phi, latch_e);
> -         stmt_vec_info lc_info = loop_vinfo->lookup_def (def);
> -         if (lc_info)
> -           {
> -             stmts.create (1);
> -             stmts.quick_push (vect_stmt_to_vectorize (lc_info));
> -             if (! vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group,
> -                                            stmts, roots, remain,
> -                                            max_tree_size, &limit,
> -                                            bst_map, force_single_lane))
> -               return opt_result::failure_at (vect_location,
> -                                              "SLP build failed.\n");
> -           }
> -         /* When the latch def is from a different cycle this can only
> -            be a induction.  Build a simple instance for this.
> -            ???  We should be able to start discovery from the PHI
> -            for all inductions, but then there will be stray
> -            non-SLP stmts we choke on as needing non-SLP handling.  */
> -         auto_vec<stmt_vec_info, 1> tem;
> -         tem.quick_push (stmt_info);
> -         if (!bst_map->get (tem))
> -           {
> -             stmts.create (1);
> -             stmts.quick_push (stmt_info);
> -             if (! vect_build_slp_instance (vinfo, slp_inst_kind_reduc_group,
> -                                            stmts, roots, remain,
> -                                            max_tree_size, &limit,
> -                                            bst_map, force_single_lane))
> -               return opt_result::failure_at (vect_location,
> -                                              "SLP build failed.\n");
> -           }
> -       }
>      }
>  
>    hash_set<slp_tree> visited_patterns;
> @@ -8542,6 +8538,7 @@ vect_slp_analyze_node_operations (vec_info *vinfo, 
> slp_tree node,
>       insertion place.  */
>    if (res
>        && !seen_non_constant_child
> +      && SLP_INSTANCE_KIND (node_instance) != slp_inst_kind_scalar_iv
>        && SLP_TREE_SCALAR_STMTS (node).is_empty ())
>      {
>        if (dump_enabled_p ())
> @@ -8986,8 +8983,10 @@ vect_slp_analyze_operations (vec_info *vinfo)
>         stmt_vec_info stmt_info;
>         if (!SLP_INSTANCE_ROOT_STMTS (instance).is_empty ())
>           stmt_info = SLP_INSTANCE_ROOT_STMTS (instance)[0];
> -       else
> +       else if (!SLP_TREE_SCALAR_STMTS (node).is_empty ())
>           stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
> +       else
> +         stmt_info = SLP_TREE_REPRESENTATIVE (node);
>         if (is_a <loop_vec_info> (vinfo))
>           {
>             if (dump_enabled_p ())
> @@ -11617,7 +11616,8 @@ vect_schedule_slp_node (vec_info *vinfo,
>  
>    stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node);
>  
> -  gcc_assert (SLP_TREE_VEC_DEFS (node).is_empty ());
> +  gcc_assert (SLP_TREE_VEC_DEFS (node).is_empty ()
> +           || SLP_INSTANCE_KIND (instance) == slp_inst_kind_scalar_iv);
>    if (SLP_TREE_VECTYPE (node))
>      SLP_TREE_VEC_DEFS (node).create (vect_get_num_copies (vinfo, node));
>  
> @@ -11636,7 +11636,8 @@ vect_schedule_slp_node (vec_info *vinfo,
>    else if (!SLP_TREE_PERMUTE_P (node)
>          && (SLP_TREE_TYPE (node) == cycle_phi_info_type
>              || SLP_TREE_TYPE (node) == induc_vec_info_type
> -            || SLP_TREE_TYPE (node) == phi_info_type))
> +            || SLP_TREE_TYPE (node) == phi_info_type
> +            || SLP_TREE_TYPE (node) == scalar_iv_info_type))
>      {
>        /* For PHI node vectorization we do not use the insertion iterator.  */
>        si = gsi_none ();
> @@ -12024,7 +12025,8 @@ vect_schedule_scc (vec_info *vinfo, slp_tree node, 
> slp_instance instance,
>    maxdfs++;
>  
>    /* Leaf.  */
> -  if (SLP_TREE_DEF_TYPE (node) != vect_internal_def)
> +  if (SLP_TREE_DEF_TYPE (node) != vect_internal_def
> +      || SLP_INSTANCE_KIND (instance) == slp_inst_kind_scalar_iv)
>      {
>        info->on_stack = false;
>        vect_schedule_slp_node (vinfo, node, instance);
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 
> 83acbb3ff67ccdd4a39606850a23f483d6a4b1fb..5bddd7f37da4ea1048998b2c82ed464aa10a6730
>  100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -435,7 +435,7 @@ vect_stmt_relevant_p (stmt_vec_info stmt_info, 
> loop_vec_info loop_vinfo,
>                        "vec_stmt_relevant_p: PHI forced live for "
>                        "early break.\n");
>        LOOP_VINFO_EARLY_BREAKS_LIVE_IVS (loop_vinfo).safe_push (stmt_info);
> -      *live_p = true;
> +      return true;
>      }
>  
>    if (*live_p && *relevant == vect_unused_in_scope
> @@ -12750,17 +12750,12 @@ can_vectorize_live_stmts (vec_info *vinfo,
>                         bool vec_stmt_p,
>                         stmt_vector_for_cost *cost_vec)
>  {
> -  loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo);
>    stmt_vec_info slp_stmt_info;
>    unsigned int i;
>    FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (slp_node), i, slp_stmt_info)
>      {
>        if (slp_stmt_info
> -       && (STMT_VINFO_LIVE_P (slp_stmt_info)
> -           || (loop_vinfo
> -               && LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
> -               && STMT_VINFO_DEF_TYPE (slp_stmt_info)
> -               == vect_induction_def))
> +       && STMT_VINFO_LIVE_P (slp_stmt_info)
>         && !vectorizable_live_operation (vinfo, slp_stmt_info, slp_node,
>                                          slp_node_instance, i,
>                                          vec_stmt_p, cost_vec))
> @@ -12796,6 +12791,21 @@ vect_analyze_stmt (vec_info *vinfo,
>                                    stmt_info->stmt);
>      }
>  
> +  /* Check if it's a scalar IV that we can codegen it.  Scalar IVs aren't 
> forced
> +     live as we don't want the vectorizer to analyze it as we don't set e.g.
> +     vectype and we don't want it to be used in determining VF.  */
> +  if (SLP_INSTANCE_KIND (node_instance) == slp_inst_kind_scalar_iv
> +      && is_a <loop_vec_info> (vinfo))
> +    {
> +      if (!vectorizable_scalar_induction (as_a <loop_vec_info> (vinfo),
> +                                       stmt_info, node, cost_vec))
> +     return opt_result::failure_at (stmt_info->stmt,
> +                                    "not vectorized:"
> +                                    " scalar IV not supported: %G",
> +                                    stmt_info->stmt);
> +      return opt_result::success ();
> +    }
> +

Given we're keying off the whole instance I'd prefer if you check
this in vect_slp_analyze_operations in the iteration over SLP instances.
It feels more home there.  It also makes it obvious that this SLP
instance forms a completely separate subgraph (no side-entries, etc.).
This should save some special casing you put in elsewhere - looking
at the SLP instance kind in general is unreliable due to side-entries.

>    /* Skip stmts that do not need to be vectorized.  */
>    if (!STMT_VINFO_RELEVANT_P (stmt_info)
>        && !STMT_VINFO_LIVE_P (stmt_info))
> @@ -12852,6 +12862,7 @@ vect_analyze_stmt (vec_info *vinfo,
>        gcall *call = dyn_cast <gcall *> (stmt_info->stmt);
>        gcc_assert (SLP_TREE_VECTYPE (node)
>                 || gimple_code (stmt_info->stmt) == GIMPLE_COND
> +               || SLP_INSTANCE_KIND (node_instance) == 
> slp_inst_kind_scalar_iv

given the early outs above, this should never trigger?

>                 || (call && gimple_call_lhs (call) == NULL_TREE));
>      }
>  
> @@ -13031,6 +13042,12 @@ vect_transform_stmt (vec_info *vinfo,
>        gcc_assert (done);
>        break;
>  
> +    case scalar_iv_info_type:
> +      done = vectorizable_scalar_induction (as_a <loop_vec_info> (vinfo),
> +                                         stmt_info, slp_node, NULL);
> +      gcc_assert (done);
> +      break;
> +
>      case permute_info_type:
>        done = vectorizable_slp_permutation (vinfo, gsi, slp_node, NULL);
>        gcc_assert (done);
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 
> 905a29142d3eb8077ab9fb29b3cceb04834848fe..d0aabebccdf0e308999d39378ebd0c1b503b50f7
>  100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -243,6 +243,7 @@ enum stmt_vec_info_type {
>    phi_info_type,
>    recurr_info_type,
>    loop_exit_ctrl_vec_info_type,
> +  scalar_iv_info_type,
>    permute_info_type
>  };
>  
> @@ -392,7 +393,8 @@ enum slp_instance_kind {
>      slp_inst_kind_reduc_chain,
>      slp_inst_kind_bb_reduc,
>      slp_inst_kind_ctor,
> -    slp_inst_kind_gcond
> +    slp_inst_kind_gcond,
> +    slp_inst_kind_scalar_iv
>  };
>  
>  /* SLP instance is a sequence of stmts in a loop that can be packed into
> @@ -1236,6 +1238,10 @@ public:
>       happen.  */
>    auto_vec<gimple*> early_break_vuses;
>  
> +  /* The number of scalar iterations performed as vector in the case the loop
> +     exits from the main exit block.  This can be an SSA name or a constant. 
>  */
> +  tree niters_vector_mult_vf;
> +
>    /* Record statements that are needed to be live for early break 
> vectorization
>       but may not have an LC PHI node materialized yet in the exits.  */
>    auto_vec<stmt_vec_info> early_break_live_ivs;
> @@ -1306,6 +1312,7 @@ public:
>    (L)->early_break_live_ivs
>  #define LOOP_VINFO_EARLY_BRK_DEST_BB(L)    (L)->early_break_dest_bb
>  #define LOOP_VINFO_EARLY_BRK_VUSES(L)      (L)->early_break_vuses
> +#define LOOP_VINFO_VECTOR_NITERS_VF(L)     (L)->niters_vector_mult_vf
>  #define LOOP_VINFO_LOOP_CONDS(L)           (L)->conds
>  #define LOOP_VINFO_LOOP_IV_COND(L)         (L)->loop_iv_cond
>  #define LOOP_VINFO_NO_DATA_DEPENDENCIES(L) (L)->no_data_dependencies
> @@ -2705,6 +2712,8 @@ extern bool vectorizable_recurr (loop_vec_info, 
> stmt_vec_info,
>  extern bool vectorizable_early_exit (loop_vec_info, stmt_vec_info,
>                                    gimple_stmt_iterator *,
>                                    slp_tree, stmt_vector_for_cost *);
> +extern bool vectorizable_scalar_induction (loop_vec_info, stmt_vec_info,
> +                                        slp_tree, stmt_vector_for_cost *);
>  extern bool vect_emulated_vector_p (tree);
>  extern bool vect_can_vectorize_without_simd_p (tree_code);
>  extern bool vect_can_vectorize_without_simd_p (code_helper);
> 
> 
> 

-- 
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH][vect]: support vectorization of early break forced live IVs as scalar

Reply via email to