Nvidia GPU Volta+ (sm_70+) Independent Thread Scheduling

Thomas Schwinge Tue, 13 Jul 2021 08:49:58 -0700

Hi!

Starting with the Volta family (sm_70+), Nvidia GPUs introduced
Independent Thread Scheduling for the 32 threads ("32 SIMD lanes") that
constitute a warp, which means "execution state per thread, including a
program counter", succeeding the previous "warp-synchronous" abstraction
where "warps used a single program counter shared amongst all 32 threads
in the warp together with an active mask specifying the active threads of
the warp".  See
<https://docs.nvidia.com/cuda/parallel-thread-execution/#independent-thread-scheduling>,
<https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture>,
<https://docs.nvidia.com/cuda/cuda-c-programming-guide/#independent-thread-scheduling-7-x>,
etc.


Per PR96005 commit 2a1586401a21dcd43e0f904bb6eec26c8b2f366b
"[nvptx] Add -mptx=3.1/6.3", Tom has already added implemented the
necessary 'shfl' -> 'shfl.sync' and 'vote' -> 'vote.sync' changes,
hard-coding a 'membermask' of '0xffffffff' (all threads participate).
This I understand to be the direct translation, avoiding the
deprecated/removed non-'.sync' variants of these instructions, but
otherwise maintaining the pre-Independent Thread Scheduling semantics (as
well as performance level, supposedly).  (Unless there are further
changes relevant to GCC/nvptx that I'm not currently seeing?) this means
that we now comply to the sm_70+ Independent Thread Scheduling
requirements -- but don't actually use its capabilities.

Now, I haven't spent much thought on it yet, but it would seem to me (gut
feeling?) that continuing to maintain "warp-synchronicity" (that is,
avoid using Independent Thread Scheduling) should still yield best
performance?  Or, given the GCC/nvptx offloading context via
OpenACC/OpenMP, has anyone already made any thoughts about how actually
using Independent Thread Scheduling would be beneficial?  Can it be
exploited via OpenACC/OpenMP directly?  Can it somehow be used to
increase performance?  Can it be used to simplify parts of the GCC/nvptx
back end implementation (without sacrifying performance -- is it a
zero-cost abstraction, compared to "warp-synchronicity")?


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955

Nvidia GPU Volta+ (sm_70+) Independent Thread Scheduling

Reply via email to