On Tue, 19 Jan 2016, Thomas Schwinge wrote: > Hi! > > With nvptx offloading, in one OpenACC test case, we're running into the > following fatal error (GOMP_DEBUG=1 output): > > [...] > info : Function properties for 'LBM_performStreamCollide$_omp_fn$0': > info : used 87 registers, 0 stack, 8 bytes smem, 328 bytes cmem[0], 80 > bytes cmem[2], 0 bytes lmem > [...] > nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: launch gangs=32, > workers=32, vectors=32 > > libgomp: cuLaunchKernel error: too many resources requested for launch > > Very likely this means that the number of registers used in this function > ("used 87 registers"), multiplied by the thread block size (workers * > vectors, "workers=32, vectors=32"), exceeds the hardware maximum.
Yes, today most CUDA GPUs allow 64K regs per block, some allow 32K, so 87*32*32 definitely overflows that limit. A reference is available in CUDA C Programming, appendix G, table 13: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities > (One problem certainly might be that we're currently not doing any > register allocation for nvptx, as far as I remember based on the idea > that PTX is only a "virtual ISA", and the PTX JIT compiler would "fix > this up" for us -- which I'm not sure it actually is doing?) (well, if you want I can point out that 1) GCC never emits launch bounds so PTX JIT has to guess limits -- that's something I'd like to play with in the future, time permitting 2) OpenACC register copying at forks increases (pseudo-)register pressure 3) I think if you inspect PTX code you'll see it used way more than 87 regs) As for the proposed patch, does the OpenACC spec leave the implementation freedom to spawn a different number of workers than requested? (honest question -- I didn't look at the spec that closely) > Alternatively/additionally, we could try experimenting with using the > following of enum CUjit_option "Online compiler and linker options": [snip] > ..., to have the PTX JIT reduce the number of live registers (if > possible; I don't know), and/or could try experimenting with querying the > active device, enum CUdevice_attribute "Device properties": > > [...] > CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK = 12 > Maximum number of 32-bit registers available per block > [...] > > ..., and use that in combination with each function's enum > CUfunction_attribute "Function properties": [snip] > ... to determine an optimal number of threads per block given the number > of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK > would do that already?). I have implemented that for OpenMP offloading, but also since CUDA 6.0 there's cuOcc* (occupancy query) interface that allows to simply ask the driver about the per-function launch limit. Thanks. Alexander