Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX

Alexander Monakov Wed, 21 Oct 2015 02:07:52 -0700


On Wed, 21 Oct 2015, Jakub Jelinek wrote:

> On Tue, Oct 20, 2015 at 09:34:28PM +0300, Alexander Monakov wrote:
> > (This patch serves as a straw man proposal to have something concrete for
> > discussion and further patches)
> > 
> > On PTX, stack memory is private to each thread.  When master thread 
> > constructs
> > 'omp_data_o' on its own stack and passes it to other threads via
> > GOMP_parallel by reference, other threads cannot use the resulting pointer.
> > We need to arrange structures passed between threads be in global, or 
> > better,
> > in PTX __shared__ memory (private to each CUDA thread block).
> 
> Can you please clarify on what exactly doesn't work and what works and if it
> is just a performance issue or some other?

Sadly it's not just performance.

In PTX, stack storage is in .local address space -- and that memory is
thread-private.  A thread can make a pointer to its own stack memory and
successfully dereference it, but dereferencing that pointer from other threads
does not work (I observed it returning garbage values).

The reason for .local addresses being private like that, I think, is that
references to .local memory undergo address translation to make simultaneous
accesses to stack slots from threads in a warp form a coalesced memory
transaction.  So .local memory looking consecutive from an individual thread's
point of view are actually strided in physical memory.

So yes, when omp_data_o needs to hold a pointer to stack memory, it still won't
work.  For simple cases the compiler could notice it and provide a diagnostic
message, but in general I don't see what can be done, apart from documenting
it as a fundamental limitation.

(exposing shared memory to users might alleviate the issue slightly, but
non-trivial in itself)

Alexander

Re: [gomp4 06/14] omp-low: copy omp_data_o to shared memory on NVPTX

Reply via email to