Hi! On Tue, 19 Jan 2016 17:07:17 +0300, Alexander Monakov <amona...@ispras.ru> wrote: > On Tue, 19 Jan 2016, Alexander Monakov wrote: > > > ... to determine an optimal number of threads per block given the number > > > of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK > > > would do that already?). > > > > I have implemented that for OpenMP offloading, but also since CUDA 6.0 > > there's > > cuOcc* (occupancy query) interface that allows to simply ask the driver > > about > > the per-function launch limit.
You mean you already have implemented something along the lines I proposed? > Sorry, I should have mentioned that CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK is > indeed sufficient for limiting threads per block, which is trivially > translatable into workers per gang in OpenACC. That's good to know, thanks! > IMO it's also a cleaner > approach in this case, compared to iterative backoff (if, again, the > implementation is free to do that). It is not explicitly spelled out in OpenACC 2.0a, but it got clarified in OpenACC 2.5. See "2.5.7. num workers clause": "[...] The implementation may use a different value than specified based on limitations imposed by the target architecture". > When mentioning cuOcc* I was thinking about finding an optimal number of > blocks per device, which is a different story. :-) Grüße Thomas