On Tue, 19 Jan 2016, Thomas Schwinge wrote:

> Hi!
> 
> With nvptx offloading, in one OpenACC test case, we're running into the
> following fatal error (GOMP_DEBUG=1 output):
> 
>     [...]
>     info    : Function properties for 'LBM_performStreamCollide$_omp_fn$0':
>     info    : used 87 registers, 0 stack, 8 bytes smem, 328 bytes cmem[0], 80 
> bytes cmem[2], 0 bytes lmem
>     [...]
>       nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: launch gangs=32, 
> workers=32, vectors=32
>     
>     libgomp: cuLaunchKernel error: too many resources requested for launch
> 
> Very likely this means that the number of registers used in this function
> ("used 87 registers"), multiplied by the thread block size (workers *
> vectors, "workers=32, vectors=32"), exceeds the hardware maximum.

Yes, today most CUDA GPUs allow 64K regs per block, some allow 32K, so
87*32*32 definitely overflows that limit.  A reference is available in CUDA C
Programming, appendix G, table 13:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
 
> (One problem certainly might be that we're currently not doing any
> register allocation for nvptx, as far as I remember based on the idea
> that PTX is only a "virtual ISA", and the PTX JIT compiler would "fix
> this up" for us -- which I'm not sure it actually is doing?)

(well, if you want I can point out that
 1) GCC never emits launch bounds so PTX JIT has to guess limits -- that's
 something I'd like to play with in the future, time permitting
 2) OpenACC register copying at forks increases (pseudo-)register pressure
 3) I think if you inspect PTX code you'll see it used way more than 87 regs)

As for the proposed patch, does the OpenACC spec leave the implementation
freedom to spawn a different number of workers than requested?  (honest
question -- I didn't look at the spec that closely)

> Alternatively/additionally, we could try experimenting with using the
> following of enum CUjit_option "Online compiler and linker options":
[snip]
> ..., to have the PTX JIT reduce the number of live registers (if
> possible; I don't know), and/or could try experimenting with querying the
> active device, enum CUdevice_attribute "Device properties":
> 
>     [...]
>     CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK = 12
>         Maximum number of 32-bit registers available per block 
>     [...]
> 
> ..., and use that in combination with each function's enum
> CUfunction_attribute "Function properties":
[snip]
> ... to determine an optimal number of threads per block given the number
> of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK
> would do that already?).

I have implemented that for OpenMP offloading, but also since CUDA 6.0 there's
cuOcc* (occupancy query) interface that allows to simply ask the driver about
the per-function launch limit.

Thanks.
Alexander

Reply via email to