On Wed, 21 Oct 2015, Jakub Jelinek wrote:
> On Tue, Oct 20, 2015 at 09:34:28PM +0300, Alexander Monakov wrote: > > (This patch serves as a straw man proposal to have something concrete for > > discussion and further patches) > > > > On PTX, stack memory is private to each thread. When master thread > > constructs > > 'omp_data_o' on its own stack and passes it to other threads via > > GOMP_parallel by reference, other threads cannot use the resulting pointer. > > We need to arrange structures passed between threads be in global, or > > better, > > in PTX __shared__ memory (private to each CUDA thread block). > > Can you please clarify on what exactly doesn't work and what works and if it > is just a performance issue or some other? Sadly it's not just performance. In PTX, stack storage is in .local address space -- and that memory is thread-private. A thread can make a pointer to its own stack memory and successfully dereference it, but dereferencing that pointer from other threads does not work (I observed it returning garbage values). The reason for .local addresses being private like that, I think, is that references to .local memory undergo address translation to make simultaneous accesses to stack slots from threads in a warp form a coalesced memory transaction. So .local memory looking consecutive from an individual thread's point of view are actually strided in physical memory. So yes, when omp_data_o needs to hold a pointer to stack memory, it still won't work. For simple cases the compiler could notice it and provide a diagnostic message, but in general I don't see what can be done, apart from documenting it as a fundamental limitation. (exposing shared memory to users might alleviate the issue slightly, but non-trivial in itself) Alexander