On Tue, Jul 13, 2021 at 05:48:51PM +0200, Thomas Schwinge wrote:
> Starting with the Volta family (sm_70+), Nvidia GPUs introduced
> Independent Thread Scheduling for the 32 threads ("32 SIMD lanes") that
> constitute a warp, which means "execution state per thread, including a
> program counter", succeeding the previous "warp-synchronous" abstraction
> where "warps used a single program counter shared amongst all 32 threads
> in the warp together with an active mask specifying the active threads of
> the warp".  See
> <https://docs.nvidia.com/cuda/parallel-thread-execution/#independent-thread-scheduling>,
> <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture>,
> <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#independent-thread-scheduling-7-x>,
> etc.
> 
> Per PR96005 commit 2a1586401a21dcd43e0f904bb6eec26c8b2f366b
> "[nvptx] Add -mptx=3.1/6.3", Tom has already added implemented the
> necessary 'shfl' -> 'shfl.sync' and 'vote' -> 'vote.sync' changes,
> hard-coding a 'membermask' of '0xffffffff' (all threads participate).
> This I understand to be the direct translation, avoiding the
> deprecated/removed non-'.sync' variants of these instructions, but
> otherwise maintaining the pre-Independent Thread Scheduling semantics (as
> well as performance level, supposedly).  (Unless there are further
> changes relevant to GCC/nvptx that I'm not currently seeing?) this means
> that we now comply to the sm_70+ Independent Thread Scheduling
> requirements -- but don't actually use its capabilities.
> 
> Now, I haven't spent much thought on it yet, but it would seem to me (gut
> feeling?) that continuing to maintain "warp-synchronicity" (that is,
> avoid using Independent Thread Scheduling) should still yield best
> performance?  Or, given the GCC/nvptx offloading context via
> OpenACC/OpenMP, has anyone already made any thoughts about how actually
> using Independent Thread Scheduling would be beneficial?  Can it be
> exploited via OpenACC/OpenMP directly?  Can it somehow be used to
> increase performance?  Can it be used to simplify parts of the GCC/nvptx
> back end implementation (without sacrifying performance -- is it a
> zero-cost abstraction, compared to "warp-synchronicity")?

Is it something that is always enabled on sm_70 and later hw or does a PTX
program ask for independent thread scheduling?
If threads in the warp no longer execute in lockstep, then I think it is not
compliant to use the model we have for OpenMP with warps being OpenMP
threads and threads in warp being SIMD lanes and we'd need to switch to
have each thread in a warp being an OpenMP thread (so have 32 times more
threads than before) and only a single SIMD lane in each thread (i.e. SIMD
not used).

        Jakub

Reply via email to