> > But vectorizer computes costs of vector load of off array, 4x moving vector 
> > to
> > scalar and 4x stores.  I wonder if generic code can match this better and 
> > avoid
> > the vector load of addresses when open-coding gather/scatter?
> 
> The vectorizer does not explicitly consider the lowered for of the 
> emulated scatter when costing/code generation, instead it will actually
> emit the vector load for the off array and 4 element extracts from it.
> We could compensate for this, anticipating the followup optimization
> done by forwprop (split the vector load into scalar loads again), but
> of course code generating the loads in a different way would be better.
Yep...
> But then we'd cost 4 scalar loads here, with the current high load
> cost this might be worse overall (IIRC the vector extracts are costed
> quite cheap).  I see above vec_to_scalar is 20 - that looks quite
> high (possibly from our attempts to avoid those "some more"),
> scalar_load is 12 so it should be indeed a win, from 80 + 20
> to 48.  When we just "compensate" during scatter costing we'd
> replace 80 by nothing.

I think vec_to_scalar costing is bit odd too.  Somewhere we should take
into account the sse->int move cost that we don't, but I wan to look
into that incrementally (which will likely make it more expensive on
zens so even worse for this testcase).  Here indeed I think the main
problem is that we account it at all.
> 
> > I run into same issue when trying to cost correctly the sse->int and 
> > int->sse
> > conversions.
> > 
> > Bootstrapped/regtested x86_64-linux.  OK?  I can xfail the testcase...
> 
> I think we should fix this but it would be OK to intermedially regress,
> so I'd say leave it FAILing and open a regression bugreport?  In some
> way the testcase wants to verify we are not using 32 byte vectors here,
> I did not try to measure whether the current SSE vectorization is faster
> than not vectorizing ... maybe not vectorizing this is even better.
> Can you possibly check?

I turned it into a micro-bencmark:
/* { dg-do compile } */
/* { dg-options "-O3 -mavx2 -mno-avx512f -fdump-tree-vect-details" } */

__attribute__ ((noipa))
void foo (int n, int *off, double *a)
{
  const int m = 32;

  for (int j = 0; j < n/m; ++j)
    {
      int const start = j*m;
      int const end = (j+1)*m;

#pragma GCC ivdep
      for (int i = start; i < end; ++i)
        {
          a[off[i]] = a[i] < 0 ? a[i] : 0;
        }
    }
}

int
main()
{
        double a[1000];
        int off[1000];
        for (int i = 0; i < 1000; i++)
          a[i] = i, off[i] = (i * 3) % 1000;
        for (int i = 0; i < 10000000; i++)
                foo (1000, off, a);
        return 0;
}

/* Make sure the cost model selects SSE vectors rather than AVX to avoid
   too many scalar ops for the address computes in the loop.  */
/* { dg-final { scan-tree-dump "loop vectorized using 16 byte vectors" "vect" { 
target { ! ia32 } } } } */

On znver5 I get:

jh@shroud:~/trunk/build2/gcc> gcc -O3 -mavx2 -mno-avx512f b.c ; perf stat 
./a.out

 Performance counter stats for './a.out':

          2,184.15 msec task-clock:u                     #    1.000 CPUs 
utilized             
     9,016,958,923      cycles:u                         #    4.128 GHz         
              
       234,727,850      stalled-cycles-frontend:u        #    2.60% frontend 
cycles idle      
    31,500,139,992      instructions:u                   #    3.49  insn per 
cycle            
       350,031,235      branches:u                       #  160.260 M/sec       
              

       2.184782094 seconds time elapsed


jh@shroud:~/trunk/build2/gcc> gcc -O3 -mavx2 -mno-avx512f -fno-tree-vectorize 
b.c ; perf stat ./a.out

 Performance counter stats for './a.out':

          2,978.40 msec task-clock:u                     #    1.000 CPUs 
utilized             
    12,296,864,457      cycles:u                         #    4.129 GHz         
              
       632,728,474      stalled-cycles-frontend:u        #    5.15% frontend 
cycles idle      
    91,640,149,097      instructions:u                   #    7.45  insn per 
cycle            
    10,270,032,348      branches:u                       #    3.448 G/sec       
              

       2.979118870 seconds time elapsed

So vectorization is win here...

I will xfail it and open regression.  Indeed I think this is quite
common case that we ought to handle better (but I do not quite know how
to plumb that into vectorizer though).

Thanks!
Honza

Reply via email to