On Fri, Nov 14, 2014 at 07:37:49AM -0800, Cesar Philippidis wrote: > > Hmm. It's worthwhile to keep in mind that GPU threads really behave > > somewhat differently from CPUs (they don't really execute > > independently); the OMP model may just be a poor match for the > > architecture in general. > > One could busywait on a spinlock, but AFAIK there isn't really a way to > > put a thread to sleep. By not executing independently, I mean this: I > > believe if one thread in a warp is waiting on the spinlock, all the > > other ones are also busywaiting. There may be other effects that seem > > odd if one approaches it from a CPU perspective - for example you > > probably want only one thread in a warp to try to take the spinlock. > > Thread synchronization in CUDA is different from conventional CPUs. > Using the gang/thread terminology, there's no way to synchronize two > threads in two different gangs in PTX without invoking separate kernels. > Basically, after a kernel is invoked, the host/accelerator (the later > using dynamic parallelism) waits for the kernel to finish, and that > effectively creates a barrier.
I believe in OpenMP terminology a gang is a team, and inter-teams barriers are not supposed to work etc. (though, I think locks and atomic instructions still are, so is critical region, so I really hope atomics are atomic even inter-gang). So for synchronization (mutexes and semaphores, from which barriers are implemented; but perhaps could also use bar.arrive and bar.sync) we mainly need synchronization within the gang. > Also, keep in mind that PTX doesn't have a global TID. The user needs to > calculate it using ctaid/tid and friends. Ok. Is %gridid needed for that combo too? Jakub