Hi, I've been working on a prototype of moving early break to SLP.
As we've discussed on IRC I've decided to first try adding the gconds as roots and start SLP discovery using them as roots. This works great and doesn't require any changed to build_slp, it also has the additional benefit in that we can easily (as a follow up) add groups of gconds and then try to SLP the roots together if the operations are the same and then decompose the tree based on the roots if not. So it looks like using the roots are the best approach. However I've hit some issues that I could solve, but would require me to modify large chunks of code and would like your input before I start. 1. roots are currently not analyzed or code-gened through vectorizable_*. this is because it looks like only things used as roots so far are things that all targets support (like constructors) or that will be lowered by veclower later. This is easy to fix I can work roots into the analysis part in vect_slp_analyze_node_operations and pass enough information to vectorize_slp_instance_root_stmt to be able to use vectorizable_early_break. I have a prototype of this currently working but it's a hack and need to do it properly if it's the way you'd like to go. 2. consider the loop: #ifndef N #define N 800 #endif unsigned vect_a[N]; unsigned vect_b[N]; unsigned test4(unsigned x) { unsigned ret = 0; for (int i = 0; i < N; i++) { vect_b[i] = x + i; if (vect_a[i]*2 != x) break; vect_a[i] = x; } return ret; } The build part looks like: note: === vect_analyze_slp === note: Analyzing vectorizable control flow: if (patt_6 != 0) note: Starting SLP discovery for note: patt_6 = _4 != x_9(D); note: starting SLP discovery for node 0x5141280 note: Build SLP for patt_6 = _4 != x_9(D); note: precomputed vectype: vector(4) <signed-boolean:32> note: nunits = 4 note: vect_is_simple_use: operand x_9(D), type of def: external note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, +INF] MASK 0xfffffffe VALUE 0x0 _3 * 2, type of def: internal note: starting SLP discovery for node 0x51413a0 note: Build SLP for _4 = _3 * 2; note: precomputed vectype: vector(4) unsigned int note: nunits = 4 note: vect_is_simple_use: operand # VUSE <.MEM_10> vect_aD.4416[i_15], type of def: internal note: vect_is_simple_use: operand 2, type of def: constant note: vect_is_simple_use: operand # VUSE <.MEM_10> vect_aD.4416[i_15], type of def: internal note: vect_is_simple_use: operand 2, type of def: constant note: starting SLP discovery for node 0x5141430 note: Build SLP for _3 = vect_a[i_15]; note: precomputed vectype: vector(4) unsigned int note: nunits = 4 note: SLP discovery for node 0x5141430 succeeded note: SLP discovery for node 0x51413a0 succeeded note: SLP discovery for node 0x5141280 succeeded note: SLP size 3 vs. limit 10. note: Final SLP tree for instance 0x5208e30: note: node 0x5141280 (max_nunits=4, refcnt=2) vector(4) <signed-boolean:32> note: op template: patt_6 = _4 != x_9(D); note: stmt 0 patt_6 = _4 != x_9(D); note: children 0x5141310 0x51413a0 note: node (external) 0x5141310 (max_nunits=1, refcnt=1) note: { x_9(D) } note: node 0x51413a0 (max_nunits=4, refcnt=2) vector(4) unsigned int note: op template: _4 = _3 * 2; note: stmt 0 _4 = _3 * 2; note: children 0x5141430 0x51414c0 note: node 0x5141430 (max_nunits=4, refcnt=2) vector(4) unsigned int note: op template: _3 = vect_a[i_15]; note: stmt 0 _3 = vect_a[i_15]; note: load permutation { 0 } note: node (constant) 0x51414c0 (max_nunits=1, refcnt=1) note: { 2 } and codegen: note: ------>vectorizing statement: patt_6 = _4 != x_9(D); note: transform statement. note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, +INF] MASK 0xfffffffe VALUE 0x0 _3 * 2, type of def: internal note: vect_is_simple_use: vectype vector(4) unsigned int note: vect_is_simple_use: operand x_9(D), type of def: external note: vect_get_vec_defs_for_operand: _4 note: vect_is_simple_use: operand # RANGE [irange] unsigned int [0, 0][2, +INF] MASK 0xfffffffe VALUE 0x0 _3 * 2, type of def: internal note: def_stmt = _4 = _3 * 2; note: vect_get_vec_defs_for_operand: x_9(D) note: vect_is_simple_use: operand x_9(D), type of def: external note: created new init_stmt: vect_cst__72 = {x_9(D), x_9(D), x_9(D), x_9(D)}; note: add new stmt: mask_patt_6.25_73 = vect__4.24_71 != vect_cst__72; note: ------>vectorizing statement: if (patt_6 != 0) note: transform statement. note: === vectorizable_early_exit === note: vect_is_simple_use: operand _4 != x_9(D), type of def: internal note: vect_is_simple_use: vectype vector(4) <signed-boolean:32> note: transform early-exit. note: vect_is_simple_use: operand _4 != x_9(D), type of def: internal note: vect_is_simple_use: vectype vector(4) <signed-boolean:32> note: vect_is_simple_use: operand 0, type of def: constant note: vect_get_vec_defs_for_operand: patt_6 note: vect_is_simple_use: operand _4 != x_9(D), type of def: internal note: def_stmt = patt_6 = _4 != x_9(D); note: vect_get_vec_defs_for_operand: 0 note: vect_is_simple_use: operand 0, type of def: constant note: created new init_stmt: vect_cst__74 = { 0, 0, 0, 0 }; note: add new stmt: cmp_75 = mask_patt_6.25_73 ^ vect_cst__74; So far so good. However, things go wrong during SLP vect_detect_hybrid_slp analysis note: === vect_update_vf_for_slp === note: Loop contains SLP and non-SLP stmts note: Updating vectorization factor to 4. note: vectorization_factor = 4, niters = 800 This has a couple of reasons: 1. The stores are non-grouped stores and so are never considered for SLP. Now I've temporarily worked around this by doing during vect_analyze_slp: /* Find SLP sequences starting from non-grouped stores. */ for (auto dr : LOOP_VINFO_DATAREFS (vinfo)) if (DR_IS_WRITE (dr)) { stmt_vec_info dr_info = vinfo->lookup_stmt (DR_STMT (dr)); if (!dr_info) continue; vect_analyze_slp_instance (vinfo, bst_map, dr_info, slp_inst_kind_store, max_tree_size, &limit); } So it follows single lane stores. But I'm not sure I understand why this is needed. I thought that your earlier work to transition to SLP only would have already covered single stream stores. The above works, but I am unsure if that's the best solution, or if I'm missing something. 2. The second part that goes wrong is that due to the same IV being used by the early exit and the main exit, the main exit is now pulled into analysis: note: === vect_detect_hybrid_slp === note: Processing hybrid candidate : ivtmp_14 = ivtmp_7 - 1; note: Found loop_vect use: if (ivtmp_14 != 0) note: Processing hybrid candidate : i_12 = i_15 + 1; note: Marked SLP consumed stmt pure: i_12 = i_15 + 1; note: Processing hybrid candidate : ivtmp_7 = PHI <ivtmp_14(6), 800(2)> note: Found loop_vect use: ivtmp_14 = ivtmp_7 - 1; note: Processing hybrid candidate : if (patt_6 != 0) note: Found loop_vect sink: if (patt_6 != 0) note: marking hybrid: patt_6 = _4 != x_9(D); note: marking hybrid: _4 = _3 * 2; note: marking hybrid: _3 = vect_a[i_15]; note: marking hybrid: i_15 = PHI <i_12(6), 0(2)> note: marking hybrid: i_12 = i_15 + 1; Is the solution here that I treat LOOP_VINFO_IV_EXIT as a sink as well, and forcibly ignore it? I think this would match what the analysis code later does: note: ==> examining statement: if (ivtmp_14 != 0) note: irrelevant. This is the part I'm having the most trouble with. Today I believe we never analyse the main loop exit because nothing pulls it into the analysis. 3. I believe I also need to analyse roots during VF, i.e. vect_determine_vectorization_factor shows: note: ==> examining statement: if (_4 != x_9(D)) note: skip. note: ==> examining pattern def stmt: patt_17 = _4 != x_9(D); note: precomputed vectype: vector(2) <signed-boolean:32> note: nunits = 2 which does not seem right. Thanks, Tamar