[RFC] Support single lane SLP early break

Tamar Christina Tue, 20 Aug 2024 03:54:07 -0700

Hi,

I've been working on a prototype of moving early break to SLP.


As we've discussed on IRC I've decided to first try adding the gconds as roots
and start SLP discovery using them as roots.

This works great and doesn't require any changed to build_slp, it also has the
additional benefit in that we can easily (as a follow up) add groups of
gconds and then try to SLP the roots together if the operations are the same
and then decompose the tree based on the roots if not.

So it looks like using the roots are the best approach. However I've hit some
issues that I could solve, but would require me to modify large chunks of code
and would like your input before I start.

1. roots are currently not analyzed or code-gened through vectorizable_*.
   this is because it looks like only things used as roots so far are things
   that all targets support (like constructors) or that will be lowered by
   veclower later.  This is easy to fix  I can work roots into the analysis
   part in vect_slp_analyze_node_operations and pass enough information to
   vectorize_slp_instance_root_stmt to be able to use vectorizable_early_break.
   I have a prototype of this currently working but it's a hack and need to do
   it properly if it's the way you'd like to go.

2.  consider the loop:

#ifndef N
#define N 800
#endif
unsigned vect_a[N];
unsigned vect_b[N];

unsigned test4(unsigned x)
{
 unsigned ret = 0;
 for (int i = 0; i < N; i++)
 {
   vect_b[i] = x + i;
   if (vect_a[i]*2 != x)
     break;
   vect_a[i] = x;

 }
 return ret;
}

The build part looks like:

note:   === vect_analyze_slp ===
note:   Analyzing vectorizable control flow: if (patt_6 != 0)
note:   Starting SLP discovery for
note:     patt_6 = _4 != x_9(D);
note:   starting SLP discovery for node 0x5141280
note:   Build SLP for patt_6 = _4 != x_9(D);
note:   precomputed vectype: vector(4) <signed-boolean:32>
note:   nunits = 4
note:   vect_is_simple_use: operand x_9(D), type of def: external
note:   vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, 
+INF] MASK 0xfffffffe VALUE 0x0
_3 * 2, type of def: internal
note:   starting SLP discovery for node 0x51413a0
note:   Build SLP for _4 = _3 * 2;
note:   precomputed vectype: vector(4) unsigned int
note:   nunits = 4
note:   vect_is_simple_use: operand # VUSE <.MEM_10>
vect_aD.4416[i_15], type of def: internal
note:   vect_is_simple_use: operand 2, type of def: constant
note:   vect_is_simple_use: operand # VUSE <.MEM_10>
vect_aD.4416[i_15], type of def: internal
note:   vect_is_simple_use: operand 2, type of def: constant
note:   starting SLP discovery for node 0x5141430
note:   Build SLP for _3 = vect_a[i_15];
note:   precomputed vectype: vector(4) unsigned int
note:   nunits = 4
note:   SLP discovery for node 0x5141430 succeeded
note:   SLP discovery for node 0x51413a0 succeeded
note:   SLP discovery for node 0x5141280 succeeded
note:   SLP size 3 vs. limit 10.
note:   Final SLP tree for instance 0x5208e30:
note:   node 0x5141280 (max_nunits=4, refcnt=2) vector(4) <signed-boolean:32>
note:   op template: patt_6 = _4 != x_9(D);
note:      stmt 0 patt_6 = _4 != x_9(D);
note:      children 0x5141310 0x51413a0
note:   node (external) 0x5141310 (max_nunits=1, refcnt=1)
note:      { x_9(D) }
note:   node 0x51413a0 (max_nunits=4, refcnt=2) vector(4) unsigned int
note:   op template: _4 = _3 * 2;
note:      stmt 0 _4 = _3 * 2;
note:      children 0x5141430 0x51414c0
note:   node 0x5141430 (max_nunits=4, refcnt=2) vector(4) unsigned int
note:   op template: _3 = vect_a[i_15];
note:      stmt 0 _3 = vect_a[i_15];
note:      load permutation { 0 }
note:   node (constant) 0x51414c0 (max_nunits=1, refcnt=1)
note:      { 2 }

and codegen:

note:  ------>vectorizing statement: patt_6 = _4 != x_9(D);
note:  transform statement.
note:  vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, 
+INF] MASK 0xfffffffe VALUE 0x0
       _3 * 2, type of def: internal
note:  vect_is_simple_use: vectype vector(4) unsigned int
note:  vect_is_simple_use: operand x_9(D), type of def: external
note:  vect_get_vec_defs_for_operand: _4
note:  vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, 
+INF] MASK 0xfffffffe VALUE 0x0
       _3 * 2, type of def: internal
note:    def_stmt =  _4 = _3 * 2;
note:  vect_get_vec_defs_for_operand: x_9(D)
note:  vect_is_simple_use: operand x_9(D), type of def: external
note:  created new init_stmt: vect_cst__72 = {x_9(D), x_9(D), x_9(D), x_9(D)};
note:  add new stmt: mask_patt_6.25_73 = vect__4.24_71 != vect_cst__72;
note:  ------>vectorizing statement: if (patt_6 != 0)
note:  transform statement.
note:   === vectorizable_early_exit ===
note:   vect_is_simple_use: operand _4 != x_9(D), type of def: internal
note:   vect_is_simple_use: vectype vector(4) <signed-boolean:32>
note:   transform early-exit.
note:   vect_is_simple_use: operand _4 != x_9(D), type of def: internal
note:   vect_is_simple_use: vectype vector(4) <signed-boolean:32>
note:   vect_is_simple_use: operand 0, type of def: constant
note:   vect_get_vec_defs_for_operand: patt_6
note:   vect_is_simple_use: operand _4 != x_9(D), type of def: internal
note:     def_stmt =  patt_6 = _4 != x_9(D);
note:   vect_get_vec_defs_for_operand: 0
note:   vect_is_simple_use: operand 0, type of def: constant
note:   created new init_stmt: vect_cst__74 = { 0, 0, 0, 0 };
note:   add new stmt: cmp_75 = mask_patt_6.25_73 ^ vect_cst__74;

So far so good.

However, things go wrong during SLP vect_detect_hybrid_slp analysis

note:   === vect_update_vf_for_slp ===
note:   Loop contains SLP and non-SLP stmts
note:   Updating vectorization factor to 4.
note:  vectorization_factor = 4, niters = 800

This has a couple of reasons:

1. The stores are non-grouped stores and so are never considered for SLP.

Now I've temporarily worked around this by doing during vect_analyze_slp:

/* Find SLP sequences starting from non-grouped stores.  */
for (auto dr : LOOP_VINFO_DATAREFS (vinfo))
        if (DR_IS_WRITE (dr))
          {
            stmt_vec_info dr_info = vinfo->lookup_stmt (DR_STMT (dr));
            if (!dr_info)
              continue;

            vect_analyze_slp_instance (vinfo, bst_map, dr_info,
                                       slp_inst_kind_store, max_tree_size,
                                       &limit);
          }

So it follows single lane stores.  But I'm not sure I understand why this is
needed.  I thought that your earlier work to transition to SLP only would have
already covered single stream stores.

The above works, but I am unsure if that's the best solution, or if I'm missing
something.

2. The second part that goes wrong is that due to the same IV being used by
    the early exit and the main exit, the main exit is now pulled into analysis:

note:   === vect_detect_hybrid_slp ===
note:   Processing hybrid candidate : ivtmp_14 = ivtmp_7 - 1;
note:   Found loop_vect use: if (ivtmp_14 != 0)
note:   Processing hybrid candidate : i_12 = i_15 + 1;
note:   Marked SLP consumed stmt pure: i_12 = i_15 + 1;
note:   Processing hybrid candidate : ivtmp_7 = PHI <ivtmp_14(6), 800(2)>
note:   Found loop_vect use: ivtmp_14 = ivtmp_7 - 1;
note:   Processing hybrid candidate : if (patt_6 != 0)
note:   Found loop_vect sink: if (patt_6 != 0)
note:   marking hybrid: patt_6 = _4 != x_9(D);
note:   marking hybrid: _4 = _3 * 2;
note:   marking hybrid: _3 = vect_a[i_15];
note:   marking hybrid: i_15 = PHI <i_12(6), 0(2)>
note:   marking hybrid: i_12 = i_15 + 1;

Is the solution here that I treat LOOP_VINFO_IV_EXIT as a sink as well, and
forcibly ignore it?

I think this would match what the analysis code later does:

note:   ==> examining statement: if (ivtmp_14 != 0)
note:   irrelevant.

This is the part I'm having the most trouble with.  Today I believe we never
analyse the main loop exit because nothing pulls it into the analysis.

3. I believe I also need to analyse roots during VF, i.e.
   vect_determine_vectorization_factor shows:

note:   ==> examining statement: if (_4 != x_9(D))
note:   skip.
note:   ==> examining pattern def stmt: patt_17 = _4 != x_9(D);
note:   precomputed vectype: vector(2) <signed-boolean:32>
note:   nunits = 2

which does not seem right.

Thanks,
Tamar

[RFC] Support single lane SLP early break

Reply via email to