On Tue, Aug 15, 2023 at 4:44 AM Kewen.Lin <li...@linux.ibm.com> wrote: > > on 2023/8/14 22:16, Richard Sandiford wrote: > > "Kewen.Lin" <li...@linux.ibm.com> writes: > >> Hi Richard, > >> > >> on 2023/8/14 20:20, Richard Sandiford wrote: > >>> Thanks for the clean-ups. But... > >>> > >>> "Kewen.Lin" <li...@linux.ibm.com> writes: > >>>> Hi, > >>>> > >>>> Following Richi's suggestion [1], this patch is to move the > >>>> handlings on VMAT_GATHER_SCATTER in the final loop nest > >>>> of function vectorizable_load to its own loop. Basically > >>>> it duplicates the final loop nest, clean up some useless > >>>> set up code for the case of VMAT_GATHER_SCATTER, remove some > >>>> unreachable code. Also remove the corresponding handlings > >>>> in the final loop nest. > >>>> > >>>> Bootstrapped and regtested on x86_64-redhat-linux, > >>>> aarch64-linux-gnu and powerpc64{,le}-linux-gnu. > >>>> > >>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/623329.html > >>>> > >>>> Is it ok for trunk? > >>>> > >>>> BR, > >>>> Kewen > >>>> ----- > >>>> > >>>> gcc/ChangeLog: > >>>> > >>>> * tree-vect-stmts.cc (vectorizable_load): Move the handlings on > >>>> VMAT_GATHER_SCATTER in the final loop nest to its own loop, > >>>> and update the final nest accordingly. > >>>> --- > >>>> gcc/tree-vect-stmts.cc | 361 +++++++++++++++++++++++++---------------- > >>>> 1 file changed, 219 insertions(+), 142 deletions(-) > >>> > >>> ...that seems like quite a lot of +s. Is there nothing we can do to > >>> avoid the cut-&-paste? > >> > >> Thanks for the comments! I'm not sure if I get your question, if we > >> want to move out the handlings of VMAT_GATHER_SCATTER, the new +s seem > >> inevitable? Your concern is mainly about git blame history? > > > > No, it was more that 219-142=77, so it seems like a lot of lines > > are being duplicated rather than simply being moved. (Unlike for > > VMAT_LOAD_STORE_LANES, which was even a slight LOC saving, and so > > was a clear improvement.) > > > > So I was just wondering if there was any obvious factoring-out that > > could be done to reduce the duplication. > > ah, thanks for the clarification! > > I think the main duplication are on the loop body beginning and end, > let's take a look at them in details: > > + if (memory_access_type == VMAT_GATHER_SCATTER) > + { > + gcc_assert (alignment_support_scheme == dr_aligned > + || alignment_support_scheme == dr_unaligned_supported); > + gcc_assert (!grouped_load && !slp_perm); > + > + unsigned int inside_cost = 0, prologue_cost = 0; > > // These above are newly added. > > + for (j = 0; j < ncopies; j++) > + { > + /* 1. Create the vector or array pointer update chain. */ > + if (j == 0 && !costing_p) > + { > + if (STMT_VINFO_GATHER_SCATTER_P (stmt_info)) > + vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info, > + slp_node, &gs_info, &dataref_ptr, > + &vec_offsets); > + else > + dataref_ptr > + = vect_create_data_ref_ptr (vinfo, first_stmt_info, > aggr_type, > + at_loop, offset, &dummy, gsi, > + &ptr_incr, false, bump); > + } > + else if (!costing_p) > + { > + gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo)); > + if (!STMT_VINFO_GATHER_SCATTER_P (stmt_info)) > + dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr, > + gsi, stmt_info, bump); > + } > > // These are for dataref_ptr, in the final looop nest we deal with more cases > on simd_lane_access_p and diff_first_stmt_info, but don't handle > STMT_VINFO_GATHER_SCATTER_P any more, very few (one case) can be shared > between, > IMHO factoring out it seems like a overkill. > > + > + if (mask && !costing_p) > + vec_mask = vec_masks[j]; > > // It's merged out from j == 0 and j != 0 > > + > + gimple *new_stmt = NULL; > + for (i = 0; i < vec_num; i++) > + { > + tree final_mask = NULL_TREE; > + tree final_len = NULL_TREE; > + tree bias = NULL_TREE; > + if (!costing_p) > + { > + if (loop_masks) > + final_mask > + = vect_get_loop_mask (loop_vinfo, gsi, loop_masks, > + vec_num * ncopies, vectype, > + vec_num * j + i); > + if (vec_mask) > + final_mask = prepare_vec_mask (loop_vinfo, mask_vectype, > + final_mask, vec_mask, gsi); > + > + if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info)) > + dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, > ptr_incr, > + gsi, stmt_info, bump); > + } > > // This part is directly copied from the original, the original gets updated > by > removing && !STMT_VINFO_GATHER_SCATTER_P. Due to its size, I didn't consider > this before, do you prefer me to factor this part out? > > + if (gs_info.ifn != IFN_LAST) > + { > ... > + } > + else > + { > + /* Emulated gather-scatter. */ > ... > > // This part is just moved from the original. > > + vec_dest = vect_create_destination_var (scalar_dest, vectype); > + /* DATA_REF is null if we've already built the statement. */ > + if (data_ref) > + { > + vect_copy_ref_info (data_ref, DR_REF (first_dr_info->dr)); > + new_stmt = gimple_build_assign (vec_dest, data_ref); > + } > + new_temp = make_ssa_name (vec_dest, new_stmt); > + gimple_set_lhs (new_stmt, new_temp); > + vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi); > + > + /* Store vector loads in the corresponding SLP_NODE. */ > + if (slp) > + slp_node->push_vec_def (new_stmt); > + > + if (!slp && !costing_p) > + STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); > + } > + > + if (!slp && !costing_p) > + *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0]; > > // This part is some subsequent handlings, it's duplicated from the original > but removing some more useless code. I guess this part is not worthy > being factored out? > > + if (costing_p) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "vect_model_load_cost: inside_cost = %u, " > + "prologue_cost = %u .\n", > + inside_cost, prologue_cost); > + } > + return true; > + } > > // Duplicating the dumping, I guess it's unnecessary to be factored out. > > oh, I just noticed that this should be shorten as > "if (costing_p && dump_enabled_p ())" instead, just the same as what's > adopted for VMAT_LOAD_STORE_LANES dumping.
Just to mention, the original motivational idea was even though we duplicate some code we make it overall more readable and thus maintainable. In the end we might have vectorizable_load () for analysis but have not only load_vec_info_type but one for each VMAT_* which means multiple separate vect_transform_load () functions. Currently vectorizable_load is structured very inconsistently, having the transforms all hang off a single switch (vmat-kind) {} would be an improvement IMHO. But sure some of our internal APIs are verbose and maybe badly factored, any improvement there is welcome. Inventing new random APIs just to save a few lines of code without actually making the code more readable is IMHO bad. But, if we can for example enhance prepare_vec_mask to handle both loop and conditional mask and handle querying the mask that would be fine (of course you need to check all uses to see if that makes sense). Richard. > BR, > Kewen