Hi Richard, Did you have a chance to look at this updated patch?
Thanks. Yuri. 2015-12-15 19:41 GMT+03:00 Yuri Rumyantsev <ysrum...@gmail.com>: > Hi Richard, > > I re-designed the patch to determine ability of loop masking on fly of > vectorization analysis and invoke it after loop transformation. > Test-case is also provided. > > what is your opinion? > > Thanks. > Yuri. > > ChangeLog:: > 2015-12-15 Yuri Rumyantsev <ysrum...@gmail.com> > > * config/i386/i386.c (ix86_builtin_vectorization_cost): Add handling > of new cases. > * config/i386/i386.h (TARGET_INCREASE_MASK_STORE_COST): Add new target > macros. > * config/i386/x86-tune.def (X86_TUNE_INCREASE_MASK_STORE_COST): New > tuner. > * params.def (PARAM_VECT_COST_INCREASE_THRESHOLD): New parameter. > * target.h (enum vect_cost_for_stmt): Add new elements. > * targhooks.c (default_builtin_vectorization_cost): Extend switch for > new enum elements. > * tree-vect-loop.c : Include 3 header files. > (vect_analyze_loop_operations): Add new filelds initialization and > resetting, add computation of profitability for masking loop for > epilog. > (vectorizable_reduction): Determine ability of reduction masking > and compute its cost. > (vect_can_build_vector_iv): New function. > (vect_generate_tmps_on_preheader): Adjust compution of ration depending > on epilogue generation. > (gen_vec_iv_for_masking): New function. > (gen_vec_mask_for_loop): Likewise. > (mask_vect_load_store): Likewise. > (convert_reductions_for_masking): Likewise. > (fix_mask_for_masked_ld_st): Likewise. > (mask_loop_for_epilogue): Likewise. > (vect_transform_loop): Do not perform loop masking if it requires > peeling for gaps, add check on ability masking of loop, turn off > loop peeling if loop masking is performed, save recomputed NITERS to > correspondent filed of loop_vec_info, invoke of mask_loop_for_epilogue > after vectorization if masking is possible. > * tree-vect-stmts.c : Include tree-ssa-loop-ivopts.h. > (can_mask_load_store): New function. > (vectorizable_mask_load_store): Determine ability of load/store > masking and compute its cost. > (vectorizable_load): Likewise. > * tree-vectorizer.h (additional_loop_body_cost): New field of > loop_vec_info. > (mask_main_loop_for_epilogue): Likewise. > (LOOP_VINFO_ADDITIONAL_LOOP_BODY_COST): New macros. > (LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE): Likewise. > > gcc/testsuite/ChangeLog: > * gcc.target/i386/vect-mask-loop_for_epilogue1.c: New test. > > 2015-11-30 18:03 GMT+03:00 Yuri Rumyantsev <ysrum...@gmail.com>: >> Richard, >> >> Thanks a lot for your detailed comments! >> >> Few words about 436.cactusADM gain. The loop which was transformed for >> avx2 is very huge and this is the last inner-most loop in routine >> Bench_StaggeredLeapfrog2 (StaggeredLeapfrog2.F #366). If you don't >> have sources, let me know. >> >> Yuri. >> >> 2015-11-27 16:45 GMT+03:00 Richard Biener <richard.guent...@gmail.com>: >>> On Fri, Nov 13, 2015 at 11:35 AM, Yuri Rumyantsev <ysrum...@gmail.com> >>> wrote: >>>> Hi Richard, >>>> >>>> Here is updated version of the patch which 91) is in sync with trunk >>>> compiler and (2) contains simple cost model to estimate profitability >>>> of scalar epilogue elimination. The part related to vectorization of >>>> loops with small trip count is in process of developing. Note that >>>> implemented cost model was not tuned well for HASWELL and KNL but we >>>> got ~6% speed-up on 436.cactusADM from spec2006 suite for HASWELL. >>> >>> Ok, so I don't know where to start with this. >>> >>> First of all while I wanted to have the actual stmt processing to be >>> as post-processing >>> on the vectorized loop body I didn't want to have this competely separated >>> from >>> vectorizing. >>> >>> So, do combine_vect_loop_remainder () from vect_transform_loop, not by >>> iterating >>> over all (vectorized) loops at the end. >>> >>> Second, all the adjustments of the number of iterations for the vector >>> loop should >>> be integrated into the main vectorization scheme as should determining the >>> cost of the predication. So you'll end up adding a >>> LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE flag, determined during >>> cost analysis and during code generation adjust vector iteration computation >>> accordingly and _not_ generate the epilogue loop (or wire it up correctly in >>> the first place). >>> >>> The actual stmt processing should then still happen in a similar way as you >>> do. >>> >>> So I'm going to comment on that part only as I expect the rest will look a >>> lot >>> different. >>> >>> +/* Generate induction_vector which will be used to mask evaluation. */ >>> + >>> +static tree >>> +gen_vec_induction (loop_vec_info loop_vinfo, unsigned elem_size, unsigned >>> size) >>> +{ >>> >>> please make use of create_iv. Add more comments. I reverse-engineered >>> that you add a { { 0, ..., vf }, +, {vf, ... vf } } IV which you use >>> in gen_mask_for_remainder >>> by comparing it against { niter, ..., niter }. >>> >>> + gsi = gsi_after_labels (loop->header); >>> + niters = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >>> + ? LOOP_VINFO_NITERS (loop_vinfo) >>> + : LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo); >>> >>> that's either wrong or unnecessary. if ! peeling for alignment >>> loop-vinfo-niters >>> is equal to loop-vinfo-niters-unchanged. >>> >>> + ptr = build_int_cst (reference_alias_ptr_type (ref), 0); >>> + if (!SSA_NAME_PTR_INFO (addr)) >>> + copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr), ref); >>> >>> vect_duplicate_ssa_name_ptr_info. >>> >>> + >>> +static void >>> +fix_mask_for_masked_ld_st (vec<gimple *> *masked_stmt, tree mask) >>> +{ >>> + gimple *stmt, *new_stmt; >>> + tree old, lhs, vectype, var, n_lhs; >>> >>> no comment? what's this for. >>> >>> +/* Convert vectorized reductions to VEC_COND statements to preserve >>> + reduction semantic: >>> + s1 = x + s2 --> t = x + s2; s1 = (mask)? t : s2. */ >>> + >>> +static void >>> +convert_reductions (loop_vec_info loop_vinfo, tree mask) >>> +{ >>> >>> for reductions it looks like preserving the last iteration x plus the mask >>> could avoid predicating it this way and compensate in the reduction >>> epilogue by "subtracting" x & mask? With true predication support >>> that'll likely be more expensive of course. >>> >>> + /* Generate new VEC_COND expr. */ >>> + vec_cond_expr = build3 (VEC_COND_EXPR, vectype, mask, new_lhs, rhs); >>> + new_stmt = gimple_build_assign (lhs, vec_cond_expr); >>> >>> gimple_build_assign (lhs, VEC_COND_EXPR, vectype, mask, new_lhs, rhs); >>> >>> +/* Return true if MEM_REF is incremented by vector size and false >>> otherwise. */ >>> + >>> +static bool >>> +mem_ref_is_vec_size_incremented (loop_vec_info loop_vinfo, tree lhs) >>> +{ >>> + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); >>> >>> what?! Just look at DR_STEP of the store? >>> >>> >>> +void >>> +combine_vect_loop_remainder (loop_vec_info loop_vinfo) >>> +{ >>> + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); >>> + auto_vec<gimple *, 10> loads; >>> + auto_vec<gimple *, 5> stores; >>> >>> so you need to re-structure this in a way that it computes >>> >>> a) wheter it can perform the operation - and you need to do that >>> reliably before the operation has taken place >>> b) its cost >>> >>> instead of looking at def types or gimple_assign_load/store_p predicates >>> please look at STMT_VINFO_TYPE instead. >>> >>> I don't like the new target hook for the costing. We do need some major >>> re-structuring in the vectorizer cost model implementation, this doesn't go >>> into the right direction. >>> >>> A simplistic hook following the current scheme would have used >>> the vect_cost_for_stmt as argument and mirror builtin_vectorization_cost. >>> >>> There is not a single testcase in the patch. I would have expected one that >>> makes sure we keep the 6% speedup for cactusADM at least. >>> >>> >>> So this was a 45minute "overall" review not going into all the >>> implementation details. >>> >>> Thanks, >>> Richard. >>> >>> >>>> 2015-11-10 17:52 GMT+03:00 Richard Biener <richard.guent...@gmail.com>: >>>>> On Tue, Nov 10, 2015 at 2:02 PM, Ilya Enkovich <enkovich....@gmail.com> >>>>> wrote: >>>>>> 2015-11-10 15:30 GMT+03:00 Richard Biener <richard.guent...@gmail.com>: >>>>>>> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrum...@gmail.com> >>>>>>> wrote: >>>>>>>> Richard, >>>>>>>> >>>>>>>> It looks like misunderstanding - we assume that for GCCv6 the simple >>>>>>>> scheme of remainder will be used through introducing new IV : >>>>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html >>>>>>>> >>>>>>>> Is it true or we missed something? >>>>>>> >>>>>>> <quote> >>>>>>>> > Do you have an idea how "masking" is better be organized to be usable >>>>>>>> > for both 4b and 4c? >>>>>>>> >>>>>>>> Do 2a ... >>>>>>> Okay. >>>>>>> </quote> >>>>>> >>>>>> 2a was 'transform already vectorized loop as a separate >>>>>> post-processing'. Isn't it what this prototype patch implements? >>>>>> Current version only masks loop body which is in practice applicable >>>>>> for AVX-512 only in the most cases. With AVX-512 it's easier to see >>>>>> how profitable masking might be and it is a main target for the first >>>>>> masking version. Extending it to prologues/epilogues and thus making >>>>>> it more profitable for other targets is the next step and is out of >>>>>> the scope of this patch. >>>>> >>>>> Ok, technically the prototype transforms the already vectorized loop. >>>>> Of course I meant the vectorized loop be copied, masked and that >>>>> result used as epilogue... >>>>> >>>>> I'll queue a more detailed look into the patch for this week. >>>>> >>>>> Did you perform any measurements with this patch like # of >>>>> masked epilogues in SPEC 2006 FP (and any speedup?) >>>>> >>>>> Thanks, >>>>> Richard. >>>>> >>>>>> Thanks, >>>>>> Ilya >>>>>> >>>>>>> >>>>>>> Richard. >>>>>>>