On Wed, 21 Oct 2015, Bernd Schmidt wrote:

> On 10/21/2015 11:07 AM, Alexander Monakov wrote:
> 
> > In PTX, stack storage is in .local address space -- and that memory is
> > thread-private.  A thread can make a pointer to its own stack memory and
> > successfully dereference it, but dereferencing that pointer from other
> > threads
> > does not work (I observed it returning garbage values).
> >
> > The reason for .local addresses being private like that, I think, is that
> > references to .local memory undergo address translation to make simultaneous
> > accesses to stack slots from threads in a warp form a coalesced memory
> > transaction.  So .local memory looking consecutive from an individual
> > thread's
> > point of view are actually strided in physical memory.
> 
> This sounds a little odd. You can convert a .local pointer to a generic one
> and dereference the latter. Do you think there is such behind-the-scenes magic
> going on for accesses through generic pointers?

Yes.  It's fun: if you retrieve a generic pointer for a stack slot in
different threads, you get the same pointer.  If you dump cubin, you'll see
that local->generic conversion is a bitwise OR with a value in constant
memory, and generic->local conversion is a bitwise AND with immediate
0xffffff.

CUDA Programming Guide says on this matter: "Local memory is however organized
such that consecutive 32-bit words are accessed by consecutive thread IDs",
confirming presence of address scrambling.

Alexander

Reply via email to