On 15/04/2021 18:26, Thomas Schwinge wrote:
and optimisation, since shared memory might be faster than
the main memory on a GPU.
Do we potentially have a problem that making more use of (scarce)
gang-private memory may negatively affect peformance, because potentially
fewer OpenACC gangs may then be launched to the GPU hardware in parallel?
(Of course, OpenACC semantics conformance firstly is more important than
performance, but there may be ways to be conformant and performant;
"quality of implementation".) Have you run any such performance testing
with the benchmarking codes that we've got set up?
(As I'm more familiar with that, I'm using nvptx offloading examples in
the following, whilst assuming that similar discussion may apply for GCN
offloading, which uses similar hardware concepts, as far as I remember.)
Yes, that could happen. However, there's space for quite a lot of
scalars before performance is affected: 64KB of LDS memory shared by a
hardware-defined maximum of 40 threads gives about 1.5KB of space for
worker-reduction variables and gang-private variables. We might have a
problem if there are large private arrays.
I believe we have a "good enough" solution for the usual case, and a
v2.0 full solution is going to be big and hairy enough for a whole patch
of it's own (requiring per-gang dynamic allocation, a different memory
address space and possibly different instruction selection too).
Andrew