Hi! Starting with the Volta family (sm_70+), Nvidia GPUs introduced Independent Thread Scheduling for the 32 threads ("32 SIMD lanes") that constitute a warp, which means "execution state per thread, including a program counter", succeeding the previous "warp-synchronous" abstraction where "warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp". See <https://docs.nvidia.com/cuda/parallel-thread-execution/#independent-thread-scheduling>, <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture>, <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#independent-thread-scheduling-7-x>, etc.
Per PR96005 commit 2a1586401a21dcd43e0f904bb6eec26c8b2f366b "[nvptx] Add -mptx=3.1/6.3", Tom has already added implemented the necessary 'shfl' -> 'shfl.sync' and 'vote' -> 'vote.sync' changes, hard-coding a 'membermask' of '0xffffffff' (all threads participate). This I understand to be the direct translation, avoiding the deprecated/removed non-'.sync' variants of these instructions, but otherwise maintaining the pre-Independent Thread Scheduling semantics (as well as performance level, supposedly). (Unless there are further changes relevant to GCC/nvptx that I'm not currently seeing?) this means that we now comply to the sm_70+ Independent Thread Scheduling requirements -- but don't actually use its capabilities. Now, I haven't spent much thought on it yet, but it would seem to me (gut feeling?) that continuing to maintain "warp-synchronicity" (that is, avoid using Independent Thread Scheduling) should still yield best performance? Or, given the GCC/nvptx offloading context via OpenACC/OpenMP, has anyone already made any thoughts about how actually using Independent Thread Scheduling would be beneficial? Can it be exploited via OpenACC/OpenMP directly? Can it somehow be used to increase performance? Can it be used to simplify parts of the GCC/nvptx back end implementation (without sacrifying performance -- is it a zero-cost abstraction, compared to "warp-synchronicity")? Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955