On Mon, Oct 15, 2018 at 11:30:56AM +0200, Richard Biener wrote: > But isn't _actual_ collapsing an implementation detail?
No, it is required by the standard and in many cases it is very much observable. #pragma omp parallel for schedule(nonmonotonic: static, 23) collapse (2) for (int i = 0; i < 64; i++) for (int j = 0; j < 16; j++) a[i][j] = omp_get_thread_num (); The standard says that from the logical iteration space 64 x 16, first 23 iterations go to the first thread (i.e. i=0, j=0..15 and i=1, j=0..14), then 23 iterations go to the second thread, etc. In other constructs, e.g. the new loop construct, it is a request to distribute, parallelize and vectorize as much as possible with optional guarantee of no cross-iteration dependencies at all, but even in that case using the source loops might not be always a win, e.g. the loopnest could be 5 loops and the iteration space might be diagonal or other not exactly rectangular. > That is, can we delay the actual collapsing until after vectorization > for example? No. We can come up with some way to propagate some of the original info to the vectorizer if it helps (or teach vectorizer to recognize whatever we produce), but the mandatory transformation needs to be done immediately before optimizations make those impossible. Jakub