On Thu, 22 Oct 2015 19:41:51 +0300 Alexander Monakov <amona...@ispras.ru> wrote:
> On Thu, 22 Oct 2015, Jakub Jelinek wrote: > > Does that apply also to threads within a warp? I.e. is .local > > local to each thread in the warp, or to the whole warp, and if the > > former, how can say at the start of a SIMD region or at its end the > > local vars be broadcast to other threads and collected back? One > > thing is scalar vars, another pointers, or references to various > > types, or even bigger indirection. > > .local is indeed local to each warp member, not the warp as a whole. > What OpenACC/PTX implementation does is to copy the whole stack > frame, plus live registers: the implementation is in > nvptx.c:nvptx_propagate. > > I see two possible alternative approaches for OpenMP/PTX. > The second approach is to run all threads in the warp all the time, > making sure they execute the same code with the same data, and thus > build up the same local state. In this case we'd need to ensure this > invariant: if threads in the warp have the same state prior to > executing an instruction, they also have the same state after > executing that instruction (plus global state changes as if only one > thread executed that instruction). > > Most instructions are safe w.r.t this invariant. > Was something like this considered (and rejected?) for OpenACC? I'm not sure we understood the "global state changes as if only one thread executed that instruction" bit (do you have a citation?). But anyway, even if that works for threads within a warp, it doesn't work for warps within a CTA, so we'd still need some broadcast mechanism for those. Julian