Re: The nvptx port

Jakub Jelinek Fri, 14 Nov 2014 08:19:13 -0800

On Fri, Nov 14, 2014 at 07:37:49AM -0800, Cesar Philippidis wrote:
> > Hmm. It's worthwhile to keep in mind that GPU threads really behave
> > somewhat differently from CPUs (they don't really execute
> > independently); the OMP model may just be a poor match for the
> > architecture in general.
> > One could busywait on a spinlock, but AFAIK there isn't really a way to
> > put a thread to sleep. By not executing independently, I mean this: I
> > believe if one thread in a warp is waiting on the spinlock, all the
> > other ones are also busywaiting. There may be other effects that seem
> > odd if one approaches it from a CPU perspective - for example you
> > probably want only one thread in a warp to try to take the spinlock.
> 
> Thread synchronization in CUDA is different from conventional CPUs.
> Using the gang/thread terminology, there's no way to synchronize two
> threads in two different gangs in PTX without invoking separate kernels.
> Basically, after a kernel is invoked, the host/accelerator (the later
> using dynamic parallelism) waits for the kernel to finish, and that
> effectively creates a barrier.


I believe in OpenMP terminology a gang is a team, and inter-teams barriers
are not supposed to work etc. (though, I think locks and atomic instructions
still are, so is critical region, so I really hope atomics are atomic even
inter-gang).  So for synchronization (mutexes and semaphores, from which
barriers are implemented; but perhaps could also use bar.arrive and bar.sync)
we mainly need synchronization within the gang.

> Also, keep in mind that PTX doesn't have a global TID. The user needs to
> calculate it using ctaid/tid and friends.

Ok.  Is %gridid needed for that combo too?

        Jakub

Re: The nvptx port

Reply via email to