Hi Tom, I see that you're reviewing the libgomp changes. Please disregard the following hunk:
On 07/11/2018 12:13 PM, Cesar Philippidis wrote: > @@ -1199,12 +1202,59 @@ nvptx_exec (void (*fn), size_t mapnum, void > **hostaddrs, void **devaddrs, > default_dims[GOMP_DIM_VECTOR]); > } > pthread_mutex_unlock (&ptx_dev_lock); > + int vectors = default_dims[GOMP_DIM_VECTOR]; > + int workers = default_dims[GOMP_DIM_WORKER]; > + int gangs = default_dims[GOMP_DIM_GANG]; > + > + if (nvptx_thread()->ptx_dev->driver_version > 6050) > + { > + int grids, blocks; > + CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids, > + &blocks, function, NULL, 0, > + dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]); > + GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: " > + "grid = %d, block = %d\n", grids, blocks); > + > + gangs = grids * dev_size; > + workers = blocks / vectors; > + } I revisited this change yesterday and I noticed it was setting gangs incorrectly. Basically, gangs should be set as follows gangs = grids * (blocks / warp_size); or to be more closer to og8 as gangs = 2 * grids * (blocks / warp_size); The use of that magic constant 2 is to prevent thread starvation. That's a similar concept behind make -j<2*#threads>. Anyway, I'm still experimenting with that change. There are still some discrepancies between the way that I select num_workers and how the driver does. The driver appears to be a little bit more conservative, but according to the thread occupancy calculator, that should yield greater performance on GPUs. I just wanted to give you a heads up because you seem to be working on this. Thanks for all of your reviews! By the way, are you now maintainer of the libgomp nvptx plugin? Cesar