On 08/01/2018 04:01 PM, Cesar Philippidis wrote: > On 08/01/2018 03:18 AM, Tom de Vries wrote: >> On 07/31/2018 04:58 PM, Cesar Philippidis wrote: >>> The attached patch teaches libgomp how to use the CUDA thread occupancy >>> calculator built into the CUDA driver. Despite both being based off the >>> CUDA thread occupancy spreadsheet distributed with CUDA, the built in >>> occupancy calculator differs from the occupancy calculator in og8 in two >>> key ways. First, og8 launches twice the number of gangs as the driver >>> thread occupancy calculator. This was my attempt at preventing threads >>> from idling, and it operating on a similar principle of running 'make >>> -jN', where N is twice the number of CPU threads. >> >> You're saying the two methods are different, and that the difference >> between the two methods is a factor two, which is a heuristic you added >> yourself on top of one of the methods, which implies that in fact the >> two methods are identical. Is my understanding correct here? > > With the exception being that og8 multiples num_gangs by a factor of > two, those two algorithms are identical, at least with respect to gangs. > >>> Second, whereas og8 >>> always attempts to maximize the CUDA block size, the driver may select a >>> smaller block, which effectively decreases num_workers. >>> >> >> So, do I understand it correctly that using the function >> cuOccupancyMaxPotentialBlockSize gives us "minimum block size that can >> achieve the maximum occupancy" or some such and og8 gives us "maximum >> block size"? > > Correct. > >>> In terms of performance, there really isn't that much of a difference >>> between the CUDA driver's occupancy calculator and og8's. However, on >>> the tests that are impacted, they are generally within a factor of two >>> from one another, with some tests running faster with the driver >>> occupancy calculator and others with og8's. >>> >> >> Ack. Well, until we understand that in more detail, going with the >> driver's occupancy calculator seems the right thing to do. >> >>> Unfortunately, support for the CUDA driver API isn't universal; it's >>> only available in CUDA version 6.5 (or 6050) and newer. In this patch, >>> I'm exploiting the fact that init_cuda_lib only checks for errors on the >>> last library function initialized. >> >> That sounds incorrect to me. In init_cuda_lib I see: >> ... >> # define CUDA_ONE_CALL(call) CUDA_ONE_CALL_1 (call) >> # define CUDA_ONE_CALL_1(call) \ >> cuda_lib.call = dlsym (h, #call); \ >> if (cuda_lib.call == NULL) \ >> return false; >> CUDA_CALLS >> ... >> so in fact every library function is checked. Have you tested this with >> pre 6-5 cuda? > > I misread that. You're correct. So far, I've only tested this out with > CUDA 9. > >> I think we need to add and handle: >> ... >> CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize) >> ... >> >>> Therefore it guards the usage of >>> >>> cuOccupancyMaxPotentialBlockSizeWithFlags >>> >>> by checking driver_version. >> >> If we allow the cuOccupancyMaxPotentialBlockSize field to be NULL, we >> can test for NULL, which seems a simpler solution than testing the version. >> >>> If the driver occupancy calculator isn't >>> available, it falls back to the existing defaults. Maybe the og8 thread >>> occupancy would make a better default for older versions of CUDA, but >>> that's a patch for another day. >>> >> >> Agreed. >> >>> Is this patch OK for trunk? >> >> The patch doesn't build in a setup with >> --enable-offload-targets=nvptx-none and without cuda, that enables usage >> of plugin/cuda/cuda.h: >> ... >> /data/offload-nvptx/src/libgomp/plugin/plugin-nvptx.c:98:16: error: >> ‘cuOccupancyMaxPotentialBlockSize’ undeclared here (not in a function); >> did you mean ‘cuOccupancyMaxPotentialBlockSizeWithFlags’? >> CUDA_ONE_CALL (cuOccupancyMaxPotentialBlockSize) \ >> ... >> >>> @@ -1220,11 +1227,39 @@ nvptx_exec (void (*fn), size_t mapnum, void >>> **hostaddrs, void **devaddrs, >>> >>> { >>> bool default_dim_p[GOMP_DIM_MAX]; >>> + int vectors = nvthd->ptx_dev->default_dims[GOMP_DIM_VECTOR]; >>> + int workers = nvthd->ptx_dev->default_dims[GOMP_DIM_WORKER]; >>> + int gangs = nvthd->ptx_dev->default_dims[GOMP_DIM_GANG]; >>> + >>> + /* The CUDA driver occupancy calculator is only available on >>> + CUDA version 6.5 (6050) and newer. */ >>> + if (nvthd->ptx_dev->driver_version > 6050) >>> + { >>> + int grids, blocks; >>> + CUDA_CALL_ASSERT (cuOccupancyMaxPotentialBlockSize, &grids, >>> + &blocks, function, NULL, 0, >>> + dims[GOMP_DIM_WORKER] * dims[GOMP_DIM_VECTOR]); >>> + GOMP_PLUGIN_debug (0, "cuOccupancyMaxPotentialBlockSize: " >>> + "grid = %d, block = %d\n", grids, blocks); >>> + >> >> >>> + if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_GANG) == 0) >> >> You should use gomp_openacc_dims[0]. >> >>> + gangs = grids * (blocks / warp_size); >> >> So, we launch with gangs == grids * workers ? Is that intentional? > > Yes. At least that's what I've been using in og8. Setting num_gangs = > grids alone caused significant slow downs. >
Well, what you're saying here is: increasing num_gangs increases performance. You don't explain why you multiply with workers specifically. >>> + >>> + if (GOMP_PLUGIN_acc_default_dim (GOMP_DIM_WORKER) == 0) >>> + workers = blocks / vectors; >> >> Also, the new default calculation is not nicely separated from the >> fallback default calculation. I've updated the patch with a cleaner >> separation, attached and build without cuda but untested. > > That looks nicer. Did you want to take over this patch? No. Thanks, - Tom