On Fri, 30 Oct 2020, Richard Sandiford wrote: > Richard Biener <rguent...@suse.de> writes: > > On Wed, 28 Oct 2020, Christophe Lyon wrote: > > > >> On Wed, 28 Oct 2020 at 11:27, Christophe Lyon > >> <christophe.l...@linaro.org> wrote: > >> > > >> > On Tue, 27 Oct 2020 at 13:18, Richard Biener <rguent...@suse.de> wrote: > >> > > > >> > > This makes SLP discovery detect backedges by seeding the bst_map with > >> > > the node to be analyzed so it can be picked up from recursive calls. > >> > > This removes the need to discover backedges in a separate walk. > >> > > > >> > > This enables SLP build to handle PHI nodes in full, continuing > >> > > the SLP build to non-backedges. For loop vectorization this > >> > > enables outer loop vectorization of nested SLP cycles and for > >> > > BB vectorization this enables vectorization of PHIs at CFG merges. > >> > > > >> > > It also turns code generation into a SCC discovery walk to handle > >> > > irreducible regions and nodes only reachable via backedges where > >> > > we now also fill in vectorized backedge defs. > >> > > > >> > > This requires sanitizing the SLP tree for SLP reduction chains even > >> > > more, manually filling the backedge SLP def. > >> > > > >> > > This also exposes the fact that CFG copying (and edge splitting > >> > > until I fixed that) ends up with different edge order in the > >> > > copy which doesn't play well with the desired 1:1 mapping of > >> > > SLP PHI node children and edges for epilogue vectorization. > >> > > I've tried to fixup CFG copying here but this really looks > >> > > like a dead (or expensive) end there so I've done fixup in > >> > > slpeel_tree_duplicate_loop_to_edge_cfg instead for the cases > >> > > we can run into. > >> > > > >> > > There's still NULLs in the SLP_TREE_CHILDREN vectors and I'm > >> > > not sure it's possible to eliminate them all this stage1 so the > >> > > patch has quite some checks for this case all over the place. > >> > > > >> > > Bootstrapped and tested on x86_64-unknown-linux-gnu. SPEC CPU 2017 > >> > > and SPEC CPU 2006 successfully built and tested. > >> > > > >> > > Will push soon. > >> > > > >> > > Richard. > >> > > > >> > > 2020-10-27 Richard Biener <rguent...@suse.de> > >> > > > >> > > * gimple.h (gimple_expr_type): For PHIs return the type > >> > > of the result. > >> > > * tree-vect-loop-manip.c > >> > > (slpeel_tree_duplicate_loop_to_edge_cfg): > >> > > Make sure edge order into copied loop headers line up with the > >> > > originals. > >> > > * tree-vect-loop.c (vect_transform_cycle_phi): Handle nested > >> > > loops with SLP. > >> > > (vectorizable_phi): New function. > >> > > (vectorizable_live_operation): For BB vectorization compute > >> > > insert > >> > > location here. > >> > > * tree-vect-slp.c (vect_free_slp_tree): Deal with NULL > >> > > SLP_TREE_CHILDREN entries. > >> > > (vect_create_new_slp_node): Add overloads with pre-existing > >> > > node > >> > > argument. > >> > > (vect_print_slp_graph): Likewise. > >> > > (vect_mark_slp_stmts): Likewise. > >> > > (vect_mark_slp_stmts_relevant): Likewise. > >> > > (vect_gather_slp_loads): Likewise. > >> > > (vect_optimize_slp): Likewise. > >> > > (vect_slp_analyze_node_operations): Likewise. > >> > > (vect_bb_slp_scalar_cost): Likewise. > >> > > (vect_remove_slp_scalar_calls): Likewise. > >> > > (vect_get_and_check_slp_defs): Handle PHIs. > >> > > (vect_build_slp_tree_1): Handle PHIs. > >> > > (vect_build_slp_tree_2): Continue SLP build, following PHI > >> > > arguments. Fix memory leak. > >> > > (vect_build_slp_tree): Put stub node into the hash-map so > >> > > we can discover cycles directly. > >> > > (vect_build_slp_instance): Set the backedge SLP def for > >> > > reduction chains. > >> > > (vect_analyze_slp_backedges): Remove. > >> > > (vect_analyze_slp): Do not call it. > >> > > (vect_slp_convert_to_external): Release > >> > > SLP_TREE_LOAD_PERMUTATION. > >> > > (vect_slp_analyze_node_operations): Handle stray failed > >> > > backedge defs by failing. > >> > > (vect_slp_build_vertices): Adjust leaf condition. > >> > > (vect_bb_slp_mark_live_stmts): Handle PHIs, use visited > >> > > hash-set to handle cycles. > >> > > (vect_slp_analyze_operations): Adjust. > >> > > (vect_bb_partition_graph_r): Likewise. > >> > > (vect_slp_function): Adjust split condition to allow CFG > >> > > merges. > >> > > (vect_schedule_slp_instance): Rename to ... > >> > > (vect_schedule_slp_node): ... this. Move DFS walk to ... > >> > > (vect_schedule_scc): ... this new function. > >> > > (vect_schedule_slp): Call it. Remove ad-hoc vectorized > >> > > backedge fill code. > >> > > * tree-vect-stmts.c (vect_analyze_stmt): Call > >> > > vectorizable_phi. > >> > > (vect_transform_stmt): Likewise. > >> > > (vect_is_simple_use): Handle vect_backedge_def. > >> > > * tree-vectorizer.c (vec_info::new_stmt_vec_info): Only > >> > > set loop header PHIs to vect_unknown_def_type for loop > >> > > vectorization. > >> > > * tree-vectorizer.h (enum vect_def_type): Add > >> > > vect_backedge_def. > >> > > (enum stmt_vec_info_type): Add phi_info_type. > >> > > (vectorizable_phi): Declare. > >> > > > >> > > * gcc.dg/vect/bb-slp-54.c: New test. > >> > > * gcc.dg/vect/bb-slp-55.c: Likewise. > >> > > * gcc.dg/vect/bb-slp-56.c: Likewise. > >> > > * gcc.dg/vect/bb-slp-57.c: Likewise. > >> > > * gcc.dg/vect/bb-slp-58.c: Likewise. > >> > > * gcc.dg/vect/bb-slp-59.c: Likewise. > >> > > * gcc.dg/vect/bb-slp-60.c: Likewise. > >> > > * gcc.dg/vect/bb-slp-61.c: Likewise. > >> > > * gcc.dg/vect/bb-slp-62.c: Likewise. > >> > > * gcc.dg/vect/bb-slp-63.c: Likewise. > >> > > * gcc.dg/vect/bb-slp-64.c: Likewise. > >> > > * gcc.dg/vect/bb-slp-65.c: Likewise. > >> > > * gcc.dg/vect/bb-slp-66.c: Likewise. > >> > > * gcc.dg/vect/vect-outer-slp-1.c: Likewise. > >> > > * gfortran.dg/vect/O3-bb-slp-1.f: Likewise. > >> > > * gfortran.dg/vect/O3-bb-slp-2.f: Likewise. > >> > > * g++.dg/vect/simd-11.cc: Likewise. > >> > > >> > > >> > I've filed https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97616 > >> > because bb-slp-58.c and bb-slp-59.c fail on arm. > >> > > >> > >> And it causes a regression on aarch64: > >> FAIL: gcc.target/aarch64/sve/reduc_strict_4.c -march=armv8.2-a+sve > >> scan-assembler-times \\tfadda\\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\\.d > >> 4 > > > > I'm testing a fix for this one. > > > >> FAIL: gcc.target/aarch64/sve/reduc_strict_5.c -march=armv8.2-a+sve > >> scan-assembler-times \\tfadda\\td[0-9]+, p[0-7], d[0-9]+, z[0-9]+\\.d > >> 6 > > > > That one is more odd. For some reason we arrive at > > > > /home/rguenther/src/trunk/gcc/testsuite/gcc.target/aarch64/sve/reduc_strict_5.c:10:21: > > > > note: vect_is_simple_use: operand 0.0, type of def: constant > > /home/rguenther/src/trunk/gcc/testsuite/gcc.target/aarch64/sve/reduc_strict_5.c:10:21: > > > > missed: Build SLP failed: invalid type of def for variable-length SLP > > 0.0 > > > > we previously didn't analyze the initial value (we will later not > > build a SLP node for it but the defs are still analyzed by > > vect_get_and_check_slp_defs which I can fix). Sill I want to > > know why we cannot create a vector of all zeros. The check is > > > > if ((dt == vect_constant_def > > || dt == vect_external_def) > > && !GET_MODE_SIZE (vinfo->vector_mode).is_constant () > > && (TREE_CODE (type) == BOOLEAN_TYPE > > || !can_duplicate_and_interleave_p (vinfo, stmts.length > > (), > > type))) > > { > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > "Build SLP failed: invalid type of def " > > "for variable-length SLP %T\n", oprnd); > > return -1; > > } > > > > so the reason must be the group size (12)? It succeeds with 8. > > But then the check looks totally premature at the above point > > because for sure an all-same constant can be always duplicated? > > > > So - what breaks if we simply remove the above? Richard? > > Well, I think there are two issues: > > - Should all-same constants be OK? Agree they should, but it would > involve extending the check and making a promise that the values > won't change later (and so invalidate the analysis). > > - Do we need to check a condition at all? If we don't, > duplicate_and_interleave will ICE when trying to build invalid > variable-length constant or external definitions. So we need > to check it somewhere, even if it's not here.
OK, I assumed it was present elsewhere already (for non-SLP vect). Doing the check at its current place makes sense if the "error" can be mitigated by swapping [parent] operands or by splitting the SLP graph at other lanes. Possibly the latter could help here, but it's a conflict with a more precise check. A more precise check would be where we call vectorizable_* which I admit doesn't yet exist for constants / invariants which we simply assume can always generate. There's similar "premature" checking for vect vs. vect and vect vs. scalar shift capabilities in vect_build_slp_tree_1 which there helps to eventually shuffle the correct stmts into better matching operand positions ... So - it would probably help to "duplicate" the check to vect_slp_analyze_node_operations (where we call vectorizable_*) because we do generate externals in other places where we do _not_ have this kind of check (vect_slp_convert_to_external and several places in vect_build_slp_tree_2 where we create the node from scalar pieces). Richard. > Thanks, > Richard > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany; GF: Felix Imend