On 08/01/2018 03:18 AM, Tom de Vries wrote: > On 07/31/2018 04:58 PM, Cesar Philippidis wrote: >> The attached patch teaches libgomp how to use the CUDA thread occupancy >> calculator built into the CUDA driver. Despite both being based off the >> CUDA thread occupancy spreadsheet distributed with CUDA, the built in >> occupancy calculator differs from the occupancy calculator in og8 in two >> key ways. First, og8 launches twice the number of gangs as the driver >> thread occupancy calculator. This was my attempt at preventing threads >> from idling, and it operating on a similar principle of running 'make >> -jN', where N is twice the number of CPU threads. > > You're saying the two methods are different, and that the difference > between the two methods is a factor two, which is a heuristic you added > yourself on top of one of the methods, which implies that in fact the > two methods are identical. Is my understanding correct here?
With the exception being that og8 multiples num_gangs by a factor of two, those two algorithms are identical, at least with respect to gangs. >> Second, whereas og8 >> always attempts to maximize the CUDA block size, the driver may select a >> smaller block, which effectively decreases num_workers. >> > > So, do I understand it correctly that using the function > cuOccupancyMaxPotentialBlockSize gives us "minimum block size that can > achieve the maximum occupancy" or some such and og8 gives us "maximum > block size"? Correct. >> In terms of performance, there really isn't that much of a difference >> between the CUDA driver's occupancy calculator and og8's. However, on >> the tests that are impacted, they are generally within a factor of two >> from one another, with some tests running faster with the driver >> occupancy calculator and others with og8's. >> > > Ack. Well, until we understand that in more detail, going with the > driver's occupancy calculator seems the right thing to do. > >> Unfortunately, support for the CUDA driver API isn't universal; it's >> only available in CUDA version 6.5 (or 6050) and newer. In this patch, >> I'm exploiting the fact that init_cuda_lib only checks for errors on the >> last library function initialized. > > That sounds incorrect to me. In init_cuda_lib I see: > ... > # define CUDA_ONE_CALL(call) CUDA_ONE_CALL_1 (call) > # define CUDA_ONE_CALL_1(call) \ > cuda_lib.call = dlsym (h, #call); \ > if (cuda_lib.call == NULL) \ > return false; > CUDA_CALLS > ... > so in fact every library function is checked. Have you tested this with > pre 6-5 cuda? I misread that. You're correct. So far, I've only tested this out with CUDA 9. > I think we need to add and handle: > ... > CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize) > ... > >> Therefore it guards the usage of >> >> cuOccupancyMaxPotentialBlockSizeWithFlags >> >> by checking driver_version. > > If we allow the cuOccupancyMaxPotentialBlockSize field to be NULL, we > can test for NULL, which seems a simpler solution than testing the version. > >> If the driver occupancy calculator isn't >> available, it falls back to the existing defaults. Maybe the og8 thread >> occupancy would make a better default for older versions of CUDA, but >> that's a patch for another day. >> > > Agreed. > >> Is this patch OK for trunk? > > The patch doesn't build in a setup with > --enable-offload-targets=nvptx-none and without cuda, that enables usage > of plugin/cuda/cuda.h: > ... > /data/offload-nvptx/src/libgomp/plugin/plugin-nvptx.c:98:16: error: > ‘cuOccupancyMaxPotentialBlockSize’ undeclared here (not in a function); > did you mean ‘cuOccupancyMaxPotentialBlockSizeWithFlags’? > CUDA_ONE_CALL (cuOccupancyMaxPotentialBlockSize) \ > ... > >> @@ -1220,11 +1227,39 @@ nvptx_exec (void (*fn), size_t mapnum, void >> **hostaddrs, void **devaddrs, >> >> { >> bool default_dim_p[GOMP_DIM_MAX]; >> + int vectors = nvthd->ptx_dev->default_dims[GOMP_DIM_VECTOR]; >> + int workers = nvthd->ptx_dev->default_dims[GOMP_DIM_WORKER]; >> + int gangs = nvthd->ptx_dev->default_dims[GOMP_DIM_GANG]; >> + >> + /* The CUDA driver occupancy calculator is only available on >> + CUDA version 6.5 (6050) and newer. */ >> + if (nvthd->ptx_dev->driver_version > 6050) >> + { >> + int grids, blocks; >> + CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids, >> + &blocks, function, NULL, 0, >> + dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]); >> + GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: " >> + "grid = %d, block = %d\n", grids, blocks); >> + > > >> + if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_GANG) == 0) > > You should use gomp_openacc_dims[0]. > >> + gangs = grids * (blocks / warp_size); > > So, we launch with gangs == grids * workers ? Is that intentional? Yes. At least that's what I've been using in og8. Setting num_gangs = grids alone caused significant slow downs. >> + >> + if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_WORKER) == 0) >> + workers = blocks / vectors; > > Also, the new default calculation is not nicely separated from the > fallback default calculation. I've updated the patch with a cleaner > separation, attached and build without cuda but untested. That looks nicer. Did you want to take over this patch? Cesar