Hi!

With nvptx offloading, in one OpenACC test case, we're running into the
following fatal error (GOMP_DEBUG=1 output):

    [...]
    info    : Function properties for 'LBM_performStreamCollide$_omp_fn$0':
    info    : used 87 registers, 0 stack, 8 bytes smem, 328 bytes cmem[0], 80 
bytes cmem[2], 0 bytes lmem
    [...]
      nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: launch gangs=32, 
workers=32, vectors=32
    
    libgomp: cuLaunchKernel error: too many resources requested for launch

Very likely this means that the number of registers used in this function
("used 87 registers"), multiplied by the thread block size (workers *
vectors, "workers=32, vectors=32"), exceeds the hardware maximum.

(One problem certainly might be that we're currently not doing any
register allocation for nvptx, as far as I remember based on the idea
that PTX is only a "virtual ISA", and the PTX JIT compiler would "fix
this up" for us -- which I'm not sure it actually is doing?)

Below I'm posting a prototype patch which makes the execution run
successfully:

    [...]
      nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: launch gangs=32, 
workers=32, vectors=32
        cuLaunchKernel: CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES; retrying with 
reduced number of workers
      nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: launch gangs=32, 
workers=16, vectors=32
      nvptx_exec: kernel LBM_performStreamCollide$_omp_fn$0: finished
    [...]

As -- I think -- the maximum number of registers in a thread block is
fixed, it would be good to remember the modified dims[GOMP_DIM_WORKER]
(which my patch doesn't).

Alternatively/additionally, we could try experimenting with using the
following of enum CUjit_option "Online compiler and linker options":

    CU_JIT_MAX_REGISTERS = 0
        Max number of registers that a thread may use. Option type: unsigned 
int Applies to: compiler only 
    CU_JIT_THREADS_PER_BLOCK
        IN: Specifies minimum number of threads per block to target compilation 
for OUT: Returns the number of threads the compiler actually targeted. This 
restricts the resource utilization fo the compiler (e.g. max registers) such 
that a block with the given number of threads should be able to launch based on 
register limitations. Note, this option does not currently take into account 
any other resource limitations, such as shared memory utilization. Cannot be 
combined with CU_JIT_TARGET. Option type: unsigned int Applies to: compiler 
only 
    [...]

..., to have the PTX JIT reduce the number of live registers (if
possible; I don't know), and/or could try experimenting with querying the
active device, enum CUdevice_attribute "Device properties":

    [...]
    CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK = 12
        Maximum number of 32-bit registers available per block 
    [...]

..., and use that in combination with each function's enum
CUfunction_attribute "Function properties":

    CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK = 0
        The maximum number of threads per block, beyond which a launch of the 
function would fail. This number depends on both the function and the device on 
which the function is currently loaded.
    [...]
    CU_FUNC_ATTRIBUTE_NUM_REGS = 4
        The number of registers used by each thread of this function. 
    [...]

... to determine an optimal number of threads per block given the number
of registers (maybe just querying CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK
would do that already?).  All these options however are more complicated
than the following simple "back-off" approach:

commit bb0bf9e50026feabe877c9d8174e78c021b002a4
Author: Thomas Schwinge <tho...@codesourcery.com>
Date:   Tue Jan 19 12:31:27 2016 +0100

    [nvptx] Try to cope with cuLaunchKernel returning 
CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
---
 gcc/gimple-fold.c             |    7 +++++++
 gcc/tree-vrp.c                |    1 +
 libgomp/plugin/plugin-nvptx.c |   28 ++++++++++++++++++++--------
 3 files changed, 28 insertions(+), 8 deletions(-)

diff --git gcc/gimple-fold.c gcc/gimple-fold.c
index a0e7b7e..e75c58e 100644
--- gcc/gimple-fold.c
+++ gcc/gimple-fold.c
@@ -2935,6 +2935,13 @@ fold_internal_goacc_dim (const gimple *call)
     return NULL_TREE;
 
   int axis = get_oacc_ifn_dim_arg (call);
+  if (axis == GOMP_DIM_WORKER)
+    {
+      /* libgomp's nvptx plugin might potentially modify
+        dims[GOMP_DIM_WORKER].  */
+      return NULL_TREE;
+    }
+
   int size = get_oacc_fn_dim_size (current_function_decl, axis);
   bool is_pos = gimple_call_internal_fn (call) == IFN_GOACC_DIM_POS;
   tree result = NULL_TREE;
diff --git gcc/tree-vrp.c gcc/tree-vrp.c
index e6c11e0..a0a78d2 100644
--- gcc/tree-vrp.c
+++ gcc/tree-vrp.c
@@ -3980,6 +3980,7 @@ extract_range_basic (value_range *vr, gimple *stmt)
          break;
        case CFN_GOACC_DIM_SIZE:
        case CFN_GOACC_DIM_POS:
+         //TODO: is this kosher regarding libgomp's nvptx plugin potentially 
modifying dims[GOMP_DIM_WORKER]?
          /* Optimizing these two internal functions helps the loop
             optimizer eliminate outer comparisons.  Size is [1,N]
             and pos is [0,N-1].  */
diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c
index eea74d4..54fd5cb 100644
--- libgomp/plugin/plugin-nvptx.c
+++ libgomp/plugin/plugin-nvptx.c
@@ -974,24 +974,36 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, 
void **devaddrs,
   r = cuMemcpy ((CUdeviceptr)dp, (CUdeviceptr)hp, mapnum * sizeof (void *));
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuMemcpy failed: %s", cuda_error (r));
+  kargs[0] = &dp;
 
-  GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
-                    " gangs=%u, workers=%u, vectors=%u\n",
-                    __FUNCTION__, targ_fn->launch->fn, dims[GOMP_DIM_GANG],
-                    dims[GOMP_DIM_WORKER], dims[GOMP_DIM_VECTOR]);
-
+ launch:
   // OpenACC           CUDA
   //
   // num_gangs         nctaid.x
   // num_workers       ntid.y
   // vector length     ntid.x
-
-  kargs[0] = &dp;
+  GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
+                    " gangs=%u, workers=%u, vectors=%u\n",
+                    __FUNCTION__, targ_fn->launch->fn, dims[GOMP_DIM_GANG],
+                    dims[GOMP_DIM_WORKER], dims[GOMP_DIM_VECTOR]);
   r = cuLaunchKernel (function,
                      dims[GOMP_DIM_GANG], 1, 1,
                      dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
                      0, dev_str->stream, kargs, 0);
-  if (r != CUDA_SUCCESS)
+  if (r == CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES)
+    {
+      /* Don't give up just yet; possibly too many threads for the kernel's
+        register count.  */
+      if (dims[GOMP_DIM_WORKER] > 1)
+       {
+         dims[GOMP_DIM_WORKER] /= 2;
+         GOMP_PLUGIN_debug (0, "    cuLaunchKernel: "
+                            "CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES; retrying "
+                            "with reduced number of workers\n");
+         goto launch;
+       }
+    }
+  if (r != CUDA_SUCCESS) //CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
 
 #ifndef DISABLE_ASYNC


Grüße
 Thomas

Reply via email to