on 2023/8/15 15:53, Richard Biener wrote: > On Tue, Aug 15, 2023 at 4:44 AM Kewen.Lin <li...@linux.ibm.com> wrote: >> >> on 2023/8/14 22:16, Richard Sandiford wrote: >>> "Kewen.Lin" <li...@linux.ibm.com> writes: >>>> Hi Richard, >>>> >>>> on 2023/8/14 20:20, Richard Sandiford wrote: >>>>> Thanks for the clean-ups. But... >>>>> >>>>> "Kewen.Lin" <li...@linux.ibm.com> writes: >>>>>> Hi, >>>>>> >>>>>> Following Richi's suggestion [1], this patch is to move the >>>>>> handlings on VMAT_GATHER_SCATTER in the final loop nest >>>>>> of function vectorizable_load to its own loop. Basically >>>>>> it duplicates the final loop nest, clean up some useless >>>>>> set up code for the case of VMAT_GATHER_SCATTER, remove some >>>>>> unreachable code. Also remove the corresponding handlings >>>>>> in the final loop nest. >>>>>> >>>>>> Bootstrapped and regtested on x86_64-redhat-linux, >>>>>> aarch64-linux-gnu and powerpc64{,le}-linux-gnu. >>>>>> >>>>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-June/623329.html >>>>>> >>>>>> Is it ok for trunk? >>>>>> >>>>>> BR, >>>>>> Kewen >>>>>> ----- >>>>>> >>>>>> gcc/ChangeLog: >>>>>> >>>>>> * tree-vect-stmts.cc (vectorizable_load): Move the handlings on >>>>>> VMAT_GATHER_SCATTER in the final loop nest to its own loop, >>>>>> and update the final nest accordingly. >>>>>> --- >>>>>> gcc/tree-vect-stmts.cc | 361 +++++++++++++++++++++++++---------------- >>>>>> 1 file changed, 219 insertions(+), 142 deletions(-) >>>>> >>>>> ...that seems like quite a lot of +s. Is there nothing we can do to >>>>> avoid the cut-&-paste? >>>> >>>> Thanks for the comments! I'm not sure if I get your question, if we >>>> want to move out the handlings of VMAT_GATHER_SCATTER, the new +s seem >>>> inevitable? Your concern is mainly about git blame history? >>> >>> No, it was more that 219-142=77, so it seems like a lot of lines >>> are being duplicated rather than simply being moved. (Unlike for >>> VMAT_LOAD_STORE_LANES, which was even a slight LOC saving, and so >>> was a clear improvement.) >>> >>> So I was just wondering if there was any obvious factoring-out that >>> could be done to reduce the duplication. >> >> ah, thanks for the clarification! >> >> I think the main duplication are on the loop body beginning and end, >> let's take a look at them in details: >> >> + if (memory_access_type == VMAT_GATHER_SCATTER) >> + { >> + gcc_assert (alignment_support_scheme == dr_aligned >> + || alignment_support_scheme == dr_unaligned_supported); >> + gcc_assert (!grouped_load && !slp_perm); >> + >> + unsigned int inside_cost = 0, prologue_cost = 0; >> >> // These above are newly added. >> >> + for (j = 0; j < ncopies; j++) >> + { >> + /* 1. Create the vector or array pointer update chain. */ >> + if (j == 0 && !costing_p) >> + { >> + if (STMT_VINFO_GATHER_SCATTER_P (stmt_info)) >> + vect_get_gather_scatter_ops (loop_vinfo, loop, stmt_info, >> + slp_node, &gs_info, >> &dataref_ptr, >> + &vec_offsets); >> + else >> + dataref_ptr >> + = vect_create_data_ref_ptr (vinfo, first_stmt_info, >> aggr_type, >> + at_loop, offset, &dummy, gsi, >> + &ptr_incr, false, bump); >> + } >> + else if (!costing_p) >> + { >> + gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo)); >> + if (!STMT_VINFO_GATHER_SCATTER_P (stmt_info)) >> + dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr, >> + gsi, stmt_info, bump); >> + } >> >> // These are for dataref_ptr, in the final looop nest we deal with more cases >> on simd_lane_access_p and diff_first_stmt_info, but don't handle >> STMT_VINFO_GATHER_SCATTER_P any more, very few (one case) can be shared >> between, >> IMHO factoring out it seems like a overkill. >> >> + >> + if (mask && !costing_p) >> + vec_mask = vec_masks[j]; >> >> // It's merged out from j == 0 and j != 0 >> >> + >> + gimple *new_stmt = NULL; >> + for (i = 0; i < vec_num; i++) >> + { >> + tree final_mask = NULL_TREE; >> + tree final_len = NULL_TREE; >> + tree bias = NULL_TREE; >> + if (!costing_p) >> + { >> + if (loop_masks) >> + final_mask >> + = vect_get_loop_mask (loop_vinfo, gsi, loop_masks, >> + vec_num * ncopies, vectype, >> + vec_num * j + i); >> + if (vec_mask) >> + final_mask = prepare_vec_mask (loop_vinfo, mask_vectype, >> + final_mask, vec_mask, >> gsi); >> + >> + if (i > 0 && !STMT_VINFO_GATHER_SCATTER_P (stmt_info)) >> + dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, >> ptr_incr, >> + gsi, stmt_info, bump); >> + } >> >> // This part is directly copied from the original, the original gets updated >> by >> removing && !STMT_VINFO_GATHER_SCATTER_P. Due to its size, I didn't consider >> this before, do you prefer me to factor this part out? >> >> + if (gs_info.ifn != IFN_LAST) >> + { >> ... >> + } >> + else >> + { >> + /* Emulated gather-scatter. */ >> ... >> >> // This part is just moved from the original. >> >> + vec_dest = vect_create_destination_var (scalar_dest, vectype); >> + /* DATA_REF is null if we've already built the statement. */ >> + if (data_ref) >> + { >> + vect_copy_ref_info (data_ref, DR_REF (first_dr_info->dr)); >> + new_stmt = gimple_build_assign (vec_dest, data_ref); >> + } >> + new_temp = make_ssa_name (vec_dest, new_stmt); >> + gimple_set_lhs (new_stmt, new_temp); >> + vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi); >> + >> + /* Store vector loads in the corresponding SLP_NODE. */ >> + if (slp) >> + slp_node->push_vec_def (new_stmt); >> + >> + if (!slp && !costing_p) >> + STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt); >> + } >> + >> + if (!slp && !costing_p) >> + *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0]; >> >> // This part is some subsequent handlings, it's duplicated from the original >> but removing some more useless code. I guess this part is not worthy >> being factored out? >> >> + if (costing_p) >> + { >> + if (dump_enabled_p ()) >> + dump_printf_loc (MSG_NOTE, vect_location, >> + "vect_model_load_cost: inside_cost = %u, " >> + "prologue_cost = %u .\n", >> + inside_cost, prologue_cost); >> + } >> + return true; >> + } >> >> // Duplicating the dumping, I guess it's unnecessary to be factored out. >> >> oh, I just noticed that this should be shorten as >> "if (costing_p && dump_enabled_p ())" instead, just the same as what's >> adopted for VMAT_LOAD_STORE_LANES dumping. > > Just to mention, the original motivational idea was even though we > duplicate some > code we make it overall more readable and thus maintainable. In the end we > might have vectorizable_load () for analysis but have not only > load_vec_info_type but one for each VMAT_* which means multiple separate > vect_transform_load () functions. Currently vectorizable_load is structured > very inconsistently, having the transforms all hang off a single > switch (vmat-kind) {} would be an improvement IMHO.
Thanks for the comments! With these two patches, now the final loop nest are only handling VMAT_CONTIGUOUS, VMAT_CONTIGUOUS_REVERSE and VMAT_CONTIGUOUS_PERMUTE. IMHO, their handlings are highly bundled, re-structuring them can have more duplicated code and potential incomplete bug fix risks as Richard pointed out. But if I read the above comments right, our final goal seems to separate all of them? I wonder if you both prefer to further separate them? > > But sure some of our internal APIs are verbose and maybe badly factored, > any improvement there is welcome. Inventing new random APIs just to > save a few lines of code without actually making the code more readable > is IMHO bad. > > But, if we can for example enhance prepare_vec_mask to handle both loop > and conditional mask and handle querying the mask that would be fine > (of course you need to check all uses to see if that makes sense). OK, will keep in mind, also add the example to my TODO list. :) BR, Kewen