On Thu, Nov 10, 2016 at 08:09:51PM +0300, Alexander Monakov wrote: > I'd like to provide an overview of the gomp-nvptx branch status. In response > to > this message I'll send two more emails, with libgomp and middle-end changes on > the branch. Some of the changes to libgomp such as build machinery > adaptations > have already received substantial comments in 2015, but the middle-end stuff > is > mostly unreviewed I believe. > > Middle-end changes mostly amount to adding SIMD-to-SIMT transforms in > omp-low.c, > as shown on the Cauldron. SIMT outlining via gimplifier abuse is not there, > and > neither is cloning of SIMD/SIMT loops. Outlining is required for correctness, > and cloning is useful as it allows to avoid intermixing SIMD+SIMT and thus be > sure that SIMT lowering does not 'dirty' SIMD loops and regress host/MIC > vectorization. I could argue that it's possible to improve my SIMT lowering > to > avoid some dirtying (like moving loop-invariant calls to GOMP_SIMT_VF()), but > the need for outlining makes that moot anyway, I think.
Approved with small nits, only very few requiring immediate action, the rest can be handled incrementally once the changes are in. Please work with Bernd on the config/nvptx bits. > To get great performance this will need further changes everywhere, including > in target-independent code, due to accidents like this bug (which I'd like to > ping given the topic): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68706 Do you or anyone else have suggestions on how to find out the threshold between when it is worth to use just a global lock wrt. many separate atomics? In any case, we'd need to analyze all the operations for whether we can use atomics for them, if we need to do use the lock for any of them, then using it for all of them is probably better than many atomics + one GOMP_atomic_* pair. Then there is the case of user defined reductions, we should try harder to use atomics for them. > With OpenMP/PTX offloading there are 5 additional failures in > check-target-libgomp: > > Two due to tests using 'usleep' in a target region: > FAIL: libgomp.c/target-32.c (test for excess errors) > FAIL: libgomp.c/thread-limit-2.c (test for excess errors) Could these be "solved" say by something like: --- libgomp/testsuite/libgomp.c/target-32.c.jj 2015-11-14 19:38:31.000000000 +0100 +++ libgomp/testsuite/libgomp.c/target-32.c 2016-11-11 09:29:50.411072865 +0100 @@ -1,7 +1,20 @@ #include <stdlib.h> #include <unistd.h> +#include <omp.h> -int main () +static inline +do_sleep (int cnt) +{ + int i; + if (omp_is_initial_device ()) + usleep (cnt); + else + for (i = 0; i < 10 * cnt; i++) + asm volatile ("" : : : "memory"); +} + +int +main () { int a = 0, b = 0, c = 0, d[7]; plus folding omp_is_initial_device as a builtin in the offloading compiler (which we want to do anyway and similar builtin is folded for OpenACC already)? > > Two with 'target nowait' (not implemented) > FAIL: libgomp.c/target-33.c execution test > FAIL: libgomp.c/target-34.c execution test > > One with 'target link' (not implemented) > FAIL: libgomp.c/target-link-1.c (test for excess errors) Can you work on implementing these during stage3? Jakub