On Wed, 21 Oct 2015, Bernd Schmidt wrote: > On 10/21/2015 11:07 AM, Alexander Monakov wrote: > > > In PTX, stack storage is in .local address space -- and that memory is > > thread-private. A thread can make a pointer to its own stack memory and > > successfully dereference it, but dereferencing that pointer from other > > threads > > does not work (I observed it returning garbage values). > > > > The reason for .local addresses being private like that, I think, is that > > references to .local memory undergo address translation to make simultaneous > > accesses to stack slots from threads in a warp form a coalesced memory > > transaction. So .local memory looking consecutive from an individual > > thread's > > point of view are actually strided in physical memory. > > This sounds a little odd. You can convert a .local pointer to a generic one > and dereference the latter. Do you think there is such behind-the-scenes magic > going on for accesses through generic pointers?
Yes. It's fun: if you retrieve a generic pointer for a stack slot in different threads, you get the same pointer. If you dump cubin, you'll see that local->generic conversion is a bitwise OR with a value in constant memory, and generic->local conversion is a bitwise AND with immediate 0xffffff. CUDA Programming Guide says on this matter: "Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs", confirming presence of address scrambling. Alexander