On 06/29/2018 10:12 AM, Cesar Philippidis wrote: > Ping. While porting the vector length patches to trunk, I realized that I mistakenly removed support for the environment variable GOMP_OPENACC_DIM in this patch (thanks for adding those test case Tom!). I'll post an updated version of this patch once I got the vector length patches working with it.
Cesar > On 06/20/2018 02:59 PM, Cesar Philippidis wrote: >> At present, the nvptx libgomp plugin does not take into account the >> amount of shared resources on GPUs (mostly shared-memory are register >> usage) when selecting the default num_gangs and num_workers. In certain >> situations, an OpenACC offloaded function can fail to launch if the GPU >> does not have sufficient shared resources to accommodate all of the >> threads in a CUDA block. This typically manifests when a PTX function >> uses a lot of registers and num_workers is set too large, although it >> can also happen if the shared-memory has been exhausted by the threads >> in a vector. >> >> This patch resolves that issue by adjusting num_workers based the amount >> of shared resources used by each threads. If worker parallelism has been >> requested, libgomp will spawn as many workers as possible up to 32. >> Without this patch, libgomp would always default to launching 32 workers >> when worker parallelism is used. >> >> Besides for the worker parallelism, this patch also includes some >> heuristics on selecting num_gangs. Before, the plugin would launch two >> gangs per GPU multiprocessor. Now it follows the formula contained in >> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA. >> >> Is this patch OK for trunk? >> >> Thanks, >> Cesar >> >