On Tue, Oct 18, 2016 at 07:58:49PM +0300, Alexander Monakov wrote: > On Tue, 18 Oct 2016, Bernd Schmidt wrote: > > The performance I saw was lower by a factor of 80 or so compared to their > > CUDA > > version, and even lower than OpenMP on the host. > > The currently published OpenMP version of LULESH simply doesn't use > openmp-simd > anywhere. This should make it obvious that it won't be anywhere near any > reasonable CUDA implementation, and also bound to be below host performance. > Besides, it's common for such benchmark suites to have very different levels > of > hand tuning for the native-CUDA implementation vs OpenMP implementation, > sometimes to the point of significant algorithmic differences. So you're > making an invalid comparison here.
This is related to the independent clause/construct (or whatever other names) discussions, the problem with LULESH's #pragma distribute parallel for rather than #pragma distribute parallel for simd is that usually it calls (inline) functions, and distribute parallel for, even with the implementation defined default for schedule() clause, isn't just let the implementation choose distribution between teams/threads/simd it likes; for loops which don't call any functions we can scan the loop body and figure out if it could e.g. through various omp_* calls observe anything that could reveal how it is distributed among teams/threads/simd, but for loops that can call other functions that is hard to do, especially as early as during omp lowering/expansion. OpenMP 5.0 is likely going to have some clause or whatever that will just say the loop iterations are completely independent, but until then the programmer uses more prescriptive pragmas and needs to be careful what exactly they want. But, certainly we should collect some OpenMP/OpenACC offloading benchmarks or write our own and use that to compare GCC with other compilers. Jakub