Re: [PATCH 1/3] openacc: Add support for gang local storage allocation in shared memory

Andrew Stubbs Fri, 16 Apr 2021 09:05:47 -0700

On 15/04/2021 18:26, Thomas Schwinge wrote:

and optimisation, since shared memory might be faster than
the main memory on a GPU.


Do we potentially have a problem that making more use of (scarce)
gang-private memory may negatively affect peformance, because potentially
fewer OpenACC gangs may then be launched to the GPU hardware in parallel?
(Of course, OpenACC semantics conformance firstly is more important than
performance, but there may be ways to be conformant and performant;
"quality of implementation".)  Have you run any such performance testing
with the benchmarking codes that we've got set up?

(As I'm more familiar with that, I'm using nvptx offloading examples in
the following, whilst assuming that similar discussion may apply for GCN
offloading, which uses similar hardware concepts, as far as I remember.)

Yes, that could happen. However, there's space for quite a lot ofscalars before performance is affected: 64KB of LDS memory shared by ahardware-defined maximum of 40 threads gives about 1.5KB of space forworker-reduction variables and gang-private variables. We might have aproblem if there are large private arrays.

I believe we have a "good enough" solution for the usual case, and av2.0 full solution is going to be big and hairy enough for a whole patchof it's own (requiring per-gang dynamic allocation, a different memoryaddress space and possibly different instruction selection too).


Andrew

Re: [PATCH 1/3] openacc: Add support for gang local storage allocation in shared memory

Reply via email to