Hi! I'm for the first time looking into the existing vectorization functionality in GCC (yay!), and with that I'm also for the first time encountering GCC's scalar evolution (scev) machinery (yay!), and the chains of recurrences (chrec) used by that (yay!).
Obviously, I'm right now doing my own reading and experimenting, but maybe somebody can cut that short, if my current question doesn't make much sense, and is thus easily answered: int a[NJ][NI]; #pragma acc loop collapse(2) for (int j = 0; j < N_J; ++j) for (int i = 0; i < N_I; ++i) a[j][i] = 0; Without "-fopenacc" (thus the pragma ignored), this does vectorize (for the x86_64 target, for example, without OpenACC code offloading), and also does it vectorize with "-fopenacc" enabled but the "collapse(2)" clause removed and instead another "#pragma acc loop" added in front of the inner "i" loop. But with the "collapse(2)" clause in effect, these two nested loops get, well, "collapse"d by omp-expand into one: for (int tmp = 0; tmp < N_J * N_I; ++tmp) { int j = tmp / N_I; int i = tmp % N_I; a[j][i] = 0; } This does not vectorize because of scalar evolution running into unhandled (chrec_dont_know) TRUNC_DIV_EXPR and TRUNC_MOD_EXPR in gcc/tree-scalar-evolution.c:interpret_rhs_expression. Do I have a chance in teaching it to handle these, without big effort? If that's not reasonable, I shall look for other options to address the problem that currently vectorization gets pessimized by "-fopenacc" and in particular the code rewriting for the "collapse" clause. By the way, the problem can, similarly, also be displayed in an OpenMP example, where also when such a "collapse" clause is present, the inner loop's code no longer vectorizes. (But I've not considered that case in any more detail; Jakub CCed in case that's something to look into? I don't know how OpenMP threads' loop iterations are meant to interact with OpenMP SIMD, basically.) Hmm, and without any OpenACC/OpenMP etc., actually the same problem is also present when running the following code through the vectorizer: for (int tmp = 0; tmp < N_J * N_I; ++tmp) { int j = tmp / N_I; int i = tmp % N_I; a[j][i] = 0; } ... whereas the following variant (obviously) does vectorize: int a[NJ * NI]; for (int tmp = 0; tmp < N_J * N_I; ++tmp) a[tmp] = 0; Hmm. Linearization. From a quick search, I found some 2010 work by Sebastian Pop on that topic, in the Graphite context (gcc/graphite-flattening.c), but that got pulled out again in 2012. (I have not yet looked up the history, and have not yet looked whether that'd be relevant here at all -- and we're not using Graphite here.) Regarding that, am I missing something obvious? Grüße Thomas