Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel

Richard Biener Thu, 19 Oct 2017 01:06:12 -0700

On Thu, 19 Oct 2017, Jan Hubicka wrote:

> Hi,
> this is proof of concept patch for vectorizer costs to use costs used for 
> rtx_cost
> and register_move_cost which are readily available in ix86_costs instead of 
> using
> its own set of random values.  At least until we have proof of evidence that 
> vectroizer
> costs needs to differ, I do not think we want to complicate CPU tuning by 
> having them
> twice.
> 
> This is of course quite intrusive change to what we have becuase it affects 
> all
> x86 targets.  I have finally worked out that the "random" values used by AMD 
> target
> corresponds to latencies of bdver1.
> 
> I have benchmarked them on Zen and also temporarily patches Czerny (Haswel).
> It seems to cause no regression and quite nice improvements:
>   - 27.3% for facerec on Zen
>   - 7% for mgrid on Haswel
>   - maybe 1% for galgel of Haswell
>   - 3% for facerec on Haswell
>   - maybe 1% aspi on Haswell
>   - there may be small off-noise improvement for rnflow and regression for 
> fatigue2 on Haswell
> 
> So I would say that outcome is surprisingly good (especially due to lack of
> noteworthy regressions).  I also know that vectorizer hurts performance on 
> Zen and
> Mesa/tonto benchmarks which is not cured by this patch alone.
> 
> There is testsuite fallout though.
> 
> ./testsuite/g++/g++.sum:FAIL: g++.dg/vect/slp-pr56812.cc  -std=c++11  
> scan-tree-dump-times slp1 "basic block vectorized" 1 (found 0 times)
> ./testsuite/g++/g++.sum:FAIL: g++.dg/vect/slp-pr56812.cc  -std=c++14  
> scan-tree-dump-times slp1 "basic block vectorized" 1 (found 0 times)
> ./testsuite/g++/g++.sum:FAIL: g++.dg/vect/slp-pr56812.cc  -std=c++98  
> scan-tree-dump-times slp1 "basic block vectorized" 1 (found 0 times)
> 
>   Here we vectorize the loop before first while originally we unrolled and 
> SLP vectorized next
> 
> ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_1.c 
> scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
> ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_2.c 
> scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
> ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_3.c 
> scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
> ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_4.c 
> scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
> ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_5.c 
> scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
> ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_double_6.c 
> scan-assembler-times vfmadd[123]+sd 56 (found 32 times)
> ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_1.c 
> scan-assembler-times vfmadd[123]+ss 120 (found 64 times)
> ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_2.c 
> scan-assembler-times vfmadd[123]+ss 120 (found 64 times)
> ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_3.c 
> scan-assembler-times vfmadd[123]+ss 120 (found 64 times)
> ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_4.c 
> scan-assembler-times vfmadd[123]+ss 120 (found 64 times)
> ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_5.c 
> scan-assembler-times vfmadd[123]+ss 120 (found 64 times)
> ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/l_fma_float_6.c 
> scan-assembler-times vfnmsub[123]+ss 120 (found 64 times)
> 
> And friends, clearly we do not vectorize all loops, I did not look into 
> details yet
> 
> ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/pr61403.c scan-assembler blend
> 
> Here again we vectorize loop while originally we did SLP.  I am not sure why 
> loop
> vectorizer does not use blend.
> 
> ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/pr79683.c scan-assembler-times 
> padd 1 (found 0 times)
> 
> Here we are supposed to vectorize two integer additions, but since generic 
> cost model now claims that
> latency of vector add is twice of integer add we don't.  I think it makes 
> sense.
> 
> ./testsuite/gcc/gcc.sum:FAIL: gcc.target/i386/pr79723.c scan-assembler 
> mov[au]p.[ \t][^,]+, %gs:
> 
> Similarly here.
> 
> If it seems to make sense, I will clean it up (remove now unused entries and 
> scale
> conditional costs by COSTS_N_INSNS) and fix the tessuite fallout.


Please look at the testsuite fallout in detail.  Note that only
testcases that do not disable the cost model should be affected
(all vect.exp testcases disable the cost model for example).

The patch itself looks mostly good, I suppose if we also have
separate costs for float vs. double you could do a bit better
by looking at the type in more detail.  I think vectype should
be non-NULL most of the time - but for example for the scalar cost
it might not be always there, likewise for SLP vectorization the
scalar cost calls will not have the type information so you'll
get a mixup between integer and FP costing scalar vs. vector.
Nothing that cannot be fixed on the vectorizer side, but ...
(we'd document that for scalar costs we pass the scalar type
for example).

Richard.

> Honza
> 
> Index: i386.c
> ===================================================================
> --- i386.c    (revision 253824)
> +++ i386.c    (working copy)
> @@ -44015,50 +44015,56 @@ static int
>  ix86_builtin_vectorization_cost (enum vect_cost_for_stmt type_of_cost,
>                                   tree vectype, int)
>  {
> +  bool fp = false;
> +  if (vectype != NULL)
> +    fp = FLOAT_TYPE_P (vectype);
> +
>    switch (type_of_cost)
>      {
>        case scalar_stmt:
> -        return ix86_cost->scalar_stmt_cost;
> +        return fp ? ix86_cost->addss : COSTS_N_INSNS (1);
>  
>        case scalar_load:
> -        return ix86_cost->scalar_load_cost;
> +        return COSTS_N_INSNS (fp ? ix86_cost->sse_load[0]
> +                           : ix86_cost->int_load [2]) / 2;
>  
>        case scalar_store:
> -        return ix86_cost->scalar_store_cost;
> +        return COSTS_N_INSNS (fp ? ix86_cost->sse_store[0]
> +                           : ix86_cost->int_store [2]) / 2;
>  
>        case vector_stmt:
> -        return ix86_cost->vec_stmt_cost;
> +        return fp ? ix86_cost->addss : ix86_cost->sse_op;
>  
>        case vector_load:
> -        return ix86_cost->vec_align_load_cost;
> +        return COSTS_N_INSNS (ix86_cost->sse_load[2]) / 2;
>  
>        case vector_store:
> -        return ix86_cost->vec_store_cost;
> +        return COSTS_N_INSNS (ix86_cost->sse_store[2]) / 2;
>  
>        case vec_to_scalar:
> -        return ix86_cost->vec_to_scalar_cost;
> -
>        case scalar_to_vec:
> -        return ix86_cost->scalar_to_vec_cost;
> +        return ix86_cost->sse_op;
>  
>        case unaligned_load:
> -      case unaligned_store:
>        case vector_gather_load:
> +        return COSTS_N_INSNS (ix86_cost->sse_load[2]) / 2;
> +
> +      case unaligned_store:
>        case vector_scatter_store:
> -        return ix86_cost->vec_unalign_load_cost;
> +        return COSTS_N_INSNS (ix86_cost->sse_store[2]) / 2;
>  
>        case cond_branch_taken:
> -        return ix86_cost->cond_taken_branch_cost;
> +        return COSTS_N_INSNS (ix86_cost->cond_taken_branch_cost);
>  
>        case cond_branch_not_taken:
> -        return ix86_cost->cond_not_taken_branch_cost;
> +        return COSTS_N_INSNS (ix86_cost->cond_not_taken_branch_cost);
>  
>        case vec_perm:
>        case vec_promote_demote:
> -        return ix86_cost->vec_stmt_cost;
> +        return ix86_cost->sse_op;
>  
>        case vec_construct:
> -     return ix86_cost->vec_stmt_cost * (TYPE_VECTOR_SUBPARTS (vectype) - 1);
> +     return ix86_cost->sse_op * (TYPE_VECTOR_SUBPARTS (vectype) - 1);
>  
>        default:
>          gcc_unreachable ();
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 
21284 (AG Nuernberg)

Re: [RFA] Zen tuning part 9: Add support for scatter/gather in vectorizer costmodel

Reply via email to