Hi Richard,

I re-designed the patch to determine ability of loop masking on fly of
vectorization analysis and invoke it after loop transformation.
Test-case is also provided.

what is your opinion?

Thanks.
Yuri.

ChangeLog::
2015-12-15  Yuri Rumyantsev  <ysrum...@gmail.com>

* config/i386/i386.c (ix86_builtin_vectorization_cost): Add handling
of new cases.
* config/i386/i386.h (TARGET_INCREASE_MASK_STORE_COST): Add new target
macros.
* config/i386/x86-tune.def (X86_TUNE_INCREASE_MASK_STORE_COST): New
tuner.
* params.def (PARAM_VECT_COST_INCREASE_THRESHOLD): New parameter.
* target.h (enum vect_cost_for_stmt): Add new elements.
* targhooks.c (default_builtin_vectorization_cost): Extend switch for
new enum elements.
* tree-vect-loop.c : Include 3 header files.
(vect_analyze_loop_operations): Add new filelds initialization and
resetting, add computation of profitability for masking loop for
epilog.
(vectorizable_reduction): Determine ability of reduction masking
and compute its cost.
(vect_can_build_vector_iv): New function.
(vect_generate_tmps_on_preheader): Adjust compution of ration depending
on epilogue generation.
(gen_vec_iv_for_masking): New function.
(gen_vec_mask_for_loop): Likewise.
(mask_vect_load_store): Likewise.
(convert_reductions_for_masking): Likewise.
(fix_mask_for_masked_ld_st): Likewise.
(mask_loop_for_epilogue): Likewise.
(vect_transform_loop): Do not perform loop masking if it requires
peeling for gaps, add check on ability masking of loop, turn off
loop peeling if loop masking is performed, save recomputed NITERS to
correspondent filed of loop_vec_info, invoke of mask_loop_for_epilogue
after vectorization if masking is possible.
* tree-vect-stmts.c : Include tree-ssa-loop-ivopts.h.
(can_mask_load_store): New function.
(vectorizable_mask_load_store): Determine ability of load/store
masking and compute its cost.
(vectorizable_load):  Likewise.
* tree-vectorizer.h (additional_loop_body_cost): New field of
loop_vec_info.
(mask_main_loop_for_epilogue): Likewise.
(LOOP_VINFO_ADDITIONAL_LOOP_BODY_COST): New macros.
(LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE): Likewise.

gcc/testsuite/ChangeLog:
* gcc.target/i386/vect-mask-loop_for_epilogue1.c: New test.

2015-11-30 18:03 GMT+03:00 Yuri Rumyantsev <ysrum...@gmail.com>:
> Richard,
>
> Thanks a lot for your detailed comments!
>
> Few words about 436.cactusADM gain. The loop which was transformed for
> avx2 is very huge and this is the last inner-most loop in routine
> Bench_StaggeredLeapfrog2 (StaggeredLeapfrog2.F #366). If you don't
> have sources, let me know.
>
> Yuri.
>
> 2015-11-27 16:45 GMT+03:00 Richard Biener <richard.guent...@gmail.com>:
>> On Fri, Nov 13, 2015 at 11:35 AM, Yuri Rumyantsev <ysrum...@gmail.com> wrote:
>>> Hi Richard,
>>>
>>> Here is updated version of the patch which 91) is in sync with trunk
>>> compiler and (2) contains simple cost model to estimate profitability
>>> of scalar epilogue elimination. The part related to vectorization of
>>> loops with small trip count is in process of developing. Note that
>>> implemented cost model was not tuned  well for HASWELL and KNL but we
>>> got  ~6% speed-up on 436.cactusADM from spec2006 suite for HASWELL.
>>
>> Ok, so I don't know where to start with this.
>>
>> First of all while I wanted to have the actual stmt processing to be
>> as post-processing
>> on the vectorized loop body I didn't want to have this competely separated 
>> from
>> vectorizing.
>>
>> So, do combine_vect_loop_remainder () from vect_transform_loop, not by 
>> iterating
>> over all (vectorized) loops at the end.
>>
>> Second, all the adjustments of the number of iterations for the vector
>> loop should
>> be integrated into the main vectorization scheme as should determining the
>> cost of the predication.  So you'll end up adding a
>> LOOP_VINFO_MASK_MAIN_LOOP_FOR_EPILOGUE flag, determined during
>> cost analysis and during code generation adjust vector iteration computation
>> accordingly and _not_ generate the epilogue loop (or wire it up correctly in
>> the first place).
>>
>> The actual stmt processing should then still happen in a similar way as you 
>> do.
>>
>> So I'm going to comment on that part only as I expect the rest will look a 
>> lot
>> different.
>>
>> +/* Generate induction_vector which will be used to mask evaluation.  */
>> +
>> +static tree
>> +gen_vec_induction (loop_vec_info loop_vinfo, unsigned elem_size, unsigned 
>> size)
>> +{
>>
>> please make use of create_iv.  Add more comments.  I reverse-engineered
>> that you add a { { 0, ..., vf }, +, {vf, ... vf } } IV which you use
>> in gen_mask_for_remainder
>> by comparing it against { niter, ..., niter }.
>>
>> +  gsi = gsi_after_labels (loop->header);
>> +  niters = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
>> +          ? LOOP_VINFO_NITERS (loop_vinfo)
>> +          : LOOP_VINFO_NITERS_UNCHANGED (loop_vinfo);
>>
>> that's either wrong or unnecessary.  if ! peeling for alignment
>> loop-vinfo-niters
>> is equal to loop-vinfo-niters-unchanged.
>>
>> +      ptr = build_int_cst (reference_alias_ptr_type (ref), 0);
>> +      if (!SSA_NAME_PTR_INFO (addr))
>> +       copy_ref_info (build2 (MEM_REF, TREE_TYPE (ref), addr, ptr), ref);
>>
>> vect_duplicate_ssa_name_ptr_info.
>>
>> +
>> +static void
>> +fix_mask_for_masked_ld_st (vec<gimple *> *masked_stmt, tree mask)
>> +{
>> +  gimple *stmt, *new_stmt;
>> +  tree old, lhs, vectype, var, n_lhs;
>>
>> no comment?  what's this for.
>>
>> +/* Convert vectorized reductions to VEC_COND statements to preserve
>> +   reduction semantic:
>> +       s1 = x + s2 --> t = x + s2; s1 = (mask)? t : s2.  */
>> +
>> +static void
>> +convert_reductions (loop_vec_info loop_vinfo, tree mask)
>> +{
>>
>> for reductions it looks like preserving the last iteration x plus the mask
>> could avoid predicating it this way and compensate in the reduction
>> epilogue by "subtracting" x & mask?  With true predication support
>> that'll likely be more expensive of course.
>>
>> +      /* Generate new VEC_COND expr.  */
>> +      vec_cond_expr = build3 (VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
>> +      new_stmt = gimple_build_assign (lhs, vec_cond_expr);
>>
>> gimple_build_assign (lhs, VEC_COND_EXPR, vectype, mask, new_lhs, rhs);
>>
>> +/* Return true if MEM_REF is incremented by vector size and false
>> otherwise.  */
>> +
>> +static bool
>> +mem_ref_is_vec_size_incremented (loop_vec_info loop_vinfo, tree lhs)
>> +{
>> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>>
>> what?!  Just look at DR_STEP of the store?
>>
>>
>> +void
>> +combine_vect_loop_remainder (loop_vec_info loop_vinfo)
>> +{
>> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
>> +  auto_vec<gimple *, 10> loads;
>> +  auto_vec<gimple *, 5> stores;
>>
>> so you need to re-structure this in a way that it computes
>>
>>   a) wheter it can perform the operation - and you need to do that
>>       reliably before the operation has taken place
>>   b) its cost
>>
>> instead of looking at def types or gimple_assign_load/store_p predicates
>> please look at STMT_VINFO_TYPE instead.
>>
>> I don't like the new target hook for the costing.  We do need some major
>> re-structuring in the vectorizer cost model implementation, this doesn't go
>> into the right direction.
>>
>> A simplistic hook following the current scheme would have used
>> the vect_cost_for_stmt as argument and mirror builtin_vectorization_cost.
>>
>> There is not a single testcase in the patch.  I would have expected one that
>> makes sure we keep the 6% speedup for cactusADM at least.
>>
>>
>> So this was a 45minute "overall" review not going into all the
>> implementation details.
>>
>> Thanks,
>> Richard.
>>
>>
>>> 2015-11-10 17:52 GMT+03:00 Richard Biener <richard.guent...@gmail.com>:
>>>> On Tue, Nov 10, 2015 at 2:02 PM, Ilya Enkovich <enkovich....@gmail.com> 
>>>> wrote:
>>>>> 2015-11-10 15:30 GMT+03:00 Richard Biener <richard.guent...@gmail.com>:
>>>>>> On Tue, Nov 3, 2015 at 1:08 PM, Yuri Rumyantsev <ysrum...@gmail.com> 
>>>>>> wrote:
>>>>>>> Richard,
>>>>>>>
>>>>>>> It looks like misunderstanding - we assume that for GCCv6 the simple
>>>>>>> scheme of remainder will be used through introducing new IV :
>>>>>>> https://gcc.gnu.org/ml/gcc-patches/2015-09/msg01435.html
>>>>>>>
>>>>>>> Is it true or we missed something?
>>>>>>
>>>>>> <quote>
>>>>>>> > Do you have an idea how "masking" is better be organized to be usable
>>>>>>> > for both 4b and 4c?
>>>>>>>
>>>>>>> Do 2a ...
>>>>>> Okay.
>>>>>> </quote>
>>>>>
>>>>> 2a was 'transform already vectorized loop as a separate
>>>>> post-processing'. Isn't it what this prototype patch implements?
>>>>> Current version only masks loop body which is in practice applicable
>>>>> for AVX-512 only in the most cases.  With AVX-512 it's easier to see
>>>>> how profitable masking might be and it is a main target for the first
>>>>> masking version.  Extending it to prologues/epilogues and thus making
>>>>> it more profitable for other targets is the next step and is out of
>>>>> the scope of this patch.
>>>>
>>>> Ok, technically the prototype transforms the already vectorized loop.
>>>> Of course I meant the vectorized loop be copied, masked and that
>>>> result used as epilogue...
>>>>
>>>> I'll queue a more detailed look into the patch for this week.
>>>>
>>>> Did you perform any measurements with this patch like # of
>>>> masked epilogues in SPEC 2006 FP (and any speedup?)
>>>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>>> Thanks,
>>>>> Ilya
>>>>>
>>>>>>
>>>>>> Richard.
>>>>>>

Attachment: patch
Description: Binary data

Reply via email to