> > But vectorizer computes costs of vector load of off array, 4x moving vector > > to > > scalar and 4x stores. I wonder if generic code can match this better and > > avoid > > the vector load of addresses when open-coding gather/scatter? > > The vectorizer does not explicitly consider the lowered for of the > emulated scatter when costing/code generation, instead it will actually > emit the vector load for the off array and 4 element extracts from it. > We could compensate for this, anticipating the followup optimization > done by forwprop (split the vector load into scalar loads again), but > of course code generating the loads in a different way would be better. Yep... > But then we'd cost 4 scalar loads here, with the current high load > cost this might be worse overall (IIRC the vector extracts are costed > quite cheap). I see above vec_to_scalar is 20 - that looks quite > high (possibly from our attempts to avoid those "some more"), > scalar_load is 12 so it should be indeed a win, from 80 + 20 > to 48. When we just "compensate" during scatter costing we'd > replace 80 by nothing.
I think vec_to_scalar costing is bit odd too. Somewhere we should take into account the sse->int move cost that we don't, but I wan to look into that incrementally (which will likely make it more expensive on zens so even worse for this testcase). Here indeed I think the main problem is that we account it at all. > > > I run into same issue when trying to cost correctly the sse->int and > > int->sse > > conversions. > > > > Bootstrapped/regtested x86_64-linux. OK? I can xfail the testcase... > > I think we should fix this but it would be OK to intermedially regress, > so I'd say leave it FAILing and open a regression bugreport? In some > way the testcase wants to verify we are not using 32 byte vectors here, > I did not try to measure whether the current SSE vectorization is faster > than not vectorizing ... maybe not vectorizing this is even better. > Can you possibly check? I turned it into a micro-bencmark: /* { dg-do compile } */ /* { dg-options "-O3 -mavx2 -mno-avx512f -fdump-tree-vect-details" } */ __attribute__ ((noipa)) void foo (int n, int *off, double *a) { const int m = 32; for (int j = 0; j < n/m; ++j) { int const start = j*m; int const end = (j+1)*m; #pragma GCC ivdep for (int i = start; i < end; ++i) { a[off[i]] = a[i] < 0 ? a[i] : 0; } } } int main() { double a[1000]; int off[1000]; for (int i = 0; i < 1000; i++) a[i] = i, off[i] = (i * 3) % 1000; for (int i = 0; i < 10000000; i++) foo (1000, off, a); return 0; } /* Make sure the cost model selects SSE vectors rather than AVX to avoid too many scalar ops for the address computes in the loop. */ /* { dg-final { scan-tree-dump "loop vectorized using 16 byte vectors" "vect" { target { ! ia32 } } } } */ On znver5 I get: jh@shroud:~/trunk/build2/gcc> gcc -O3 -mavx2 -mno-avx512f b.c ; perf stat ./a.out Performance counter stats for './a.out': 2,184.15 msec task-clock:u # 1.000 CPUs utilized 9,016,958,923 cycles:u # 4.128 GHz 234,727,850 stalled-cycles-frontend:u # 2.60% frontend cycles idle 31,500,139,992 instructions:u # 3.49 insn per cycle 350,031,235 branches:u # 160.260 M/sec 2.184782094 seconds time elapsed jh@shroud:~/trunk/build2/gcc> gcc -O3 -mavx2 -mno-avx512f -fno-tree-vectorize b.c ; perf stat ./a.out Performance counter stats for './a.out': 2,978.40 msec task-clock:u # 1.000 CPUs utilized 12,296,864,457 cycles:u # 4.129 GHz 632,728,474 stalled-cycles-frontend:u # 5.15% frontend cycles idle 91,640,149,097 instructions:u # 7.45 insn per cycle 10,270,032,348 branches:u # 3.448 G/sec 2.979118870 seconds time elapsed So vectorization is win here... I will xfail it and open regression. Indeed I think this is quite common case that we ought to handle better (but I do not quite know how to plumb that into vectorizer though). Thanks! Honza