On Tue, 19 Jan 2016, Alexander Monakov wrote: > > ... to determine an optimal number of threads per block given the number > > of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK > > would do that already?). > > I have implemented that for OpenMP offloading, but also since CUDA 6.0 there's > cuOcc* (occupancy query) interface that allows to simply ask the driver about > the per-function launch limit.
Sorry, I should have mentioned that CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK is indeed sufficient for limiting threads per block, which is trivially translatable into workers per gang in OpenACC. IMO it's also a cleaner approach in this case, compared to iterative backoff (if, again, the implementation is free to do that). When mentioning cuOcc* I was thinking about finding an optimal number of blocks per device, which is a different story. Alexander