Re: [patch] adjust default nvptx launch geometry for OpenACC offloaded regions

Cesar Philippidis Fri, 29 Jun 2018 14:22:48 -0700

On 06/29/2018 10:12 AM, Cesar Philippidis wrote:
> Ping.

While porting the vector length patches to trunk, I realized that I
mistakenly removed support for the environment variable GOMP_OPENACC_DIM
in this patch (thanks for adding those test case Tom!). I'll post an
updated version of this patch once I got the vector length patches
working with it.


Cesar

> On 06/20/2018 02:59 PM, Cesar Philippidis wrote:
>> At present, the nvptx libgomp plugin does not take into account the
>> amount of shared resources on GPUs (mostly shared-memory are register
>> usage) when selecting the default num_gangs and num_workers. In certain
>> situations, an OpenACC offloaded function can fail to launch if the GPU
>> does not have sufficient shared resources to accommodate all of the
>> threads in a CUDA block. This typically manifests when a PTX function
>> uses a lot of registers and num_workers is set too large, although it
>> can also happen if the shared-memory has been exhausted by the threads
>> in a vector.
>>
>> This patch resolves that issue by adjusting num_workers based the amount
>> of shared resources used by each threads. If worker parallelism has been
>> requested, libgomp will spawn as many workers as possible up to 32.
>> Without this patch, libgomp would always default to launching 32 workers
>> when worker parallelism is used.
>>
>> Besides for the worker parallelism, this patch also includes some
>> heuristics on selecting num_gangs. Before, the plugin would launch two
>> gangs per GPU multiprocessor. Now it follows the formula contained in
>> the "CUDA Occupancy Calculator" spreadsheet that's distributed with CUDA.
>>
>> Is this patch OK for trunk?
>>
>> Thanks,
>> Cesar
>>
>

Re: [patch] adjust default nvptx launch geometry for OpenACC offloaded regions

Reply via email to