Re: [RFC] Combine vectorized loops with its scalar remainder.

Yuri Rumyantsev Mon, 11 Jan 2016 02:09:01 -0800

Hi Richard,

Did you have a chance to look at this updated patch?


Thanks.
Yuri.

2015-12-15 19:41 GMT+03:00 Yuri Rumyantsev <ysrum...@gmail.com>:
> Hi Richard,
>
> I re-designed the patch to determine ability of loop masking on fly of
> vectorization analysis and invoke it after loop transformation.
> Test-case is also provided.
>
> what is your opinion?
>
> Thanks.
> Yuri.
>
> ChangeLog::
> 2015-12-15  Yuri Rumyantsev  <ysrum...@gmail.com>
>
> * config/i386/i386.c (ix86_builtin_vectorization_cost): Add handling
> of new cases.
> * config/i386/i386.h (TARGET_INCREASE_MASK_STORE_COST): Add new target
> macros.
> * config/i386/x86-tune.def (X86_TUNE_INCREASE_MASK_STORE_COST): New
> tuner.
> * params.def (PARAM_VECT_COST_INCREASE_THRESHOLD): New parameter.
> * target.h (enum vect_cost_for_stmt): Add new elements.
> * targhooks.c (default_builtin_vectorization_cost): Extend switch for
> new enum elements.
> * tree-vect-loop.c : Include 3 header files.
> (vect_analyze_loop_operations): Add new filelds initialization and
> resetting, add computation of profitability for masking loop for
> epilog.
> (vectorizable_reduction): Determine ability of reduction masking
> and compute its cost.
> (vect_can_build_vector_iv): New function.
> (vect_generate_tmps_on_preheader): Adjust compution of ration depending
> on epilogue generation.
> (gen_vec_iv_for_masking): New function.
> (gen_vec_mask_for_loop): Likewise.
> (mask_vect_load_store): Likewise.
> (convert_reductions_for_masking): Likewise.
> (fix_mask_for_masked_ld_st): Likewise.
> (mask_loop_for_epilogue): Likewise.
> (vect_transform_loop): Do not perform loop masking if it requires
> peeling for gaps, add check on ability masking of loop, turn off
> loop peeling if loop masking is performed, save recomputed NITERS to
> correspondent filed of loop_vec_info, invoke of mask_loop_for_epilogue
> after vectorization if masking is possible.
> * tree-vect-stmts.c : Include tree-ssa-loop-ivopts.h.
> (can_mask_load_store): New function.
> (vectorizable_mask_load_store): Determine ability of load/store
> masking and compute its cost.
> (vectorizable_load):  Likewise.
> * tree-vectorizer.h (additional_loop_body_cost): New field of
> loop_vec_info.
> (mask_main_loop_for_epilogue): Likewise.
> (LOOP_VINFO_ADDITIONAL_LOOP_BODY_COST): New macros.
> (LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE): Likewise.
>
> gcc/testsuite/ChangeLog:
> * gcc.target/i386/vect-mask-loop_for_epilogue1.c: New test.
>
> 2015-11-30 18:03 GMT+03:00 Yuri Rumyantsev <ysrum...@gmail.com>:
>> Richard,
>>
>> Thanks a lot for your detailed comments!
>>
>> Few words about 436.cactusADM gain. The loop which was transformed for
>> avx2 is very huge and this is the last inner-most loop in routine
>> Bench_StaggeredLeapfrog2 (StaggeredLeapfrog2.F #366). If you don't
>> have sources, let me know.
>>
>> Yuri.
>>
>> 2015-11-27 16:45 GMT+03:00 Richard Biener <richard.guent...@gmail.com>:
>>> On Fri, Nov 13, 2015 at 11:35 AM, Yuri Rumyantsev <ysrum...@gmail.com> 
>>> wrote:
>>>> Hi Richard,
>>>>
>>>> Here is updated version of the patch which 91) is in sync with trunk
>>>> compiler and (2) contains simple cost model to estimate profitability
>>>> of scalar epilogue elimination. The part related to vectorization of
>>>> loops with small trip count is in process of developing. Note that
>>>> implemented cost model was not tuned  well for HASWELL and KNL but we
>>>> got  ~6% speed-up on 436.cactusADM from spec2006 suite for HASWELL.
>>>
>>> Ok, so I don't know where to start with this.
>>>
>>> First of all while I wanted to have the actual stmt processing to be
>>> as post-processing
>>> on the vectorized loop body I didn't want to have this competely separated 
>>> from
>>> vectorizing.
>>>
>>> So, do combine_vect_loop_remainder () from vect_transform_loop, not by 
>>> iterating
>>> over all (vectorized) loops at the end.
>>>
>>> Second, all the adjustments of the number of iterations for the vector
>>> loop should
>>> be integrated into the main vectorization scheme as should determining the
>>> cost of the predication.  So you'll end up adding a
>>> LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE flag, determined during
>>> cost analysis and during code generation adjust vector iteration computation
>>> accordingly and _not_ generate the epilogue loop (or wire it up correctly in
>>> the first place).
>>>
>>> The actual stmt processing should then still happen in a similar way as you 
>>> do.
>>>
>>> So I'm going to comment on that part only as I expect the rest will look a 
>>> lot
>>> different.
>>>
>>> +/* Generate induction_vector which will be used to mask evaluation.  */
>>> +
>>> +static tree
>>> +gen_vec_induction (loop_vec_info loop_vinfo, unsigned elem_size, unsigned 
>>> size)
>>> +{
>>>
>>> please make use of create_iv.  Add more comments.  I reverse-engineered
>>> that you add a { { 0, ..., vf }, +, {vf, ... vf } } IV which you use
>>> in gen_mask_for_remainder
>>> by comparing it against { niter, ..., niter }.
>>>
>>> +  gsi = gsi_after_labels (loop->header);
>>> +  niters = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
>>> +          ? LOOP_VINFO_NITERS (loop_vinfo)
>>> +          : LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo);
>>>
>>> that's either wrong or unnecessary.  if ! peeling for alignment
>>> loop-vinfo-niters
>>> is equal to loop-vinfo-niters-unchanged.
>>>
>>> +      ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
>>> +      if (!SSA_NAME_PTR_INFO (addr))
>>> +       copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr), ref);
>>>
>>> vect_duplicate_ssa_name_ptr_info.
>>>
>>> +
>>> +static void
>>> +fix_mask_for_masked_ld_st (vec<gimple *> *masked_stmt, tree mask)
>>> +{
>>> +  gimple *stmt, *new_stmt;
>>> +  tree old, lhs, vectype, var, n_lhs;
>>>
>>> no comment?  what's this for.
>>>
>>> +/* Convert vectorized reductions to VEC_COND statements to preserve
>>> +   reduction semantic:
>>> +       s1 = x + s2 --> t = x + s2; s1 = (mask)? t : s2.  */
>>> +
>>> +static void
>>> +convert_reductions (loop_vec_info loop_vinfo, tree mask)
>>> +{
>>>
>>> for reductions it looks like preserving the last iteration x plus the mask
>>> could avoid predicating it this way and compensate in the reduction
>>> epilogue by "subtracting" x & mask?  With true predication support
>>> that'll likely be more expensive of course.
>>>
>>> +      /* Generate new VEC_COND expr.  */
>>> +      vec_cond_expr = build3 (VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
>>> +      new_stmt = gimple_build_assign (lhs, vec_cond_expr);
>>>
>>> gimple_build_assign (lhs, VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
>>>
>>> +/* Return true if MEM_REF is incremented by vector size and false
>>> otherwise.  */
>>> +
>>> +static bool
>>> +mem_ref_is_vec_size_incremented (loop_vec_info loop_vinfo, tree lhs)
>>> +{
>>> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>>>
>>> what?!  Just look at DR_STEP of the store?
>>>
>>>
>>> +void
>>> +combine_vect_loop_remainder (loop_vec_info loop_vinfo)
>>> +{
>>> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>>> +  auto_vec<gimple *, 10> loads;
>>> +  auto_vec<gimple *, 5> stores;
>>>
>>> so you need to re-structure this in a way that it computes
>>>
>>>   a) wheter it can perform the operation - and you need to do that
>>>       reliably before the operation has taken place
>>>   b) its cost
>>>
>>> instead of looking at def types or gimple_assign_load/store_p predicates
>>> please look at STMT_VINFO_TYPE instead.
>>>
>>> I don't like the new target hook for the costing.  We do need some major
>>> re-structuring in the vectorizer cost model implementation, this doesn't go
>>> into the right direction.
>>>
>>> A simplistic hook following the current scheme would have used
>>> the vect_cost_for_stmt as argument and mirror builtin_vectorization_cost.
>>>
>>> There is not a single testcase in the patch.  I would have expected one that
>>> makes sure we keep the 6% speedup for cactusADM at least.
>>>
>>>
>>> So this was a 45minute "overall" review not going into all the
>>> implementation details.
>>>
>>> Thanks,
>>> Richard.
>>>
>>>
>>>> 2015-11-10 17:52 GMT+03:00 Richard Biener <richard.guent...@gmail.com>:
>>>>> On Tue, Nov 10, 2015 at 2:02 PM, Ilya Enkovich <enkovich....@gmail.com> 
>>>>> wrote:
>>>>>> 2015-11-10 15:30 GMT+03:00 Richard Biener <richard.guent...@gmail.com>:
>>>>>>> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrum...@gmail.com> 
>>>>>>> wrote:
>>>>>>>> Richard,
>>>>>>>>
>>>>>>>> It looks like misunderstanding - we assume that for GCCv6 the simple
>>>>>>>> scheme of remainder will be used through introducing new IV :
>>>>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>>>>>>>
>>>>>>>> Is it true or we missed something?
>>>>>>>
>>>>>>> <quote>
>>>>>>>> > Do you have an idea how "masking" is better be organized to be usable
>>>>>>>> > for both 4b and 4c?
>>>>>>>>
>>>>>>>> Do 2a ...
>>>>>>> Okay.
>>>>>>> </quote>
>>>>>>
>>>>>> 2a was 'transform already vectorized loop as a separate
>>>>>> post-processing'. Isn't it what this prototype patch implements?
>>>>>> Current version only masks loop body which is in practice applicable
>>>>>> for AVX-512 only in the most cases.  With AVX-512 it's easier to see
>>>>>> how profitable masking might be and it is a main target for the first
>>>>>> masking version.  Extending it to prologues/epilogues and thus making
>>>>>> it more profitable for other targets is the next step and is out of
>>>>>> the scope of this patch.
>>>>>
>>>>> Ok, technically the prototype transforms the already vectorized loop.
>>>>> Of course I meant the vectorized loop be copied, masked and that
>>>>> result used as epilogue...
>>>>>
>>>>> I'll queue a more detailed look into the patch for this week.
>>>>>
>>>>> Did you perform any measurements with this patch like # of
>>>>> masked epilogues in SPEC 2006 FP (and any speedup?)
>>>>>
>>>>> Thanks,
>>>>> Richard.
>>>>>
>>>>>> Thanks,
>>>>>> Ilya
>>>>>>
>>>>>>>
>>>>>>> Richard.
>>>>>>>

Re: [RFC] Combine vectorized loops with its scalar remainder.

Reply via email to