Hi! On 2021-07-13T17:59:43+0200, Jakub Jelinek <ja...@redhat.com> wrote: > On Tue, Jul 13, 2021 at 05:48:51PM +0200, Thomas Schwinge wrote: >> Starting with the Volta family (sm_70+), Nvidia GPUs introduced >> Independent Thread Scheduling for the 32 threads ("32 SIMD lanes") that >> constitute a warp, which means "execution state per thread, including a >> program counter", succeeding the previous "warp-synchronous" abstraction >> where "warps used a single program counter shared amongst all 32 threads >> in the warp together with an active mask specifying the active threads of >> the warp". See >> <https://docs.nvidia.com/cuda/parallel-thread-execution/#independent-thread-scheduling>, >> <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture>, >> <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#independent-thread-scheduling-7-x>, >> etc. >> >> Per PR96005 commit 2a1586401a21dcd43e0f904bb6eec26c8b2f366b >> "[nvptx] Add -mptx=3.1/6.3", Tom has already added implemented the >> necessary 'shfl' -> 'shfl.sync' and 'vote' -> 'vote.sync' changes, >> hard-coding a 'membermask' of '0xffffffff' (all threads participate). >> This I understand to be the direct translation, avoiding the >> deprecated/removed non-'.sync' variants of these instructions, but >> otherwise maintaining the pre-Independent Thread Scheduling semantics (as >> well as performance level, supposedly). (Unless there are further >> changes relevant to GCC/nvptx that I'm not currently seeing?) this means >> that we now comply to the sm_70+ Independent Thread Scheduling >> requirements -- but don't actually use its capabilities. >> >> Now, I haven't spent much thought on it yet, but it would seem to me (gut >> feeling?) that continuing to maintain "warp-synchronicity" (that is, >> avoid using Independent Thread Scheduling) should still yield best >> performance? Or, given the GCC/nvptx offloading context via >> OpenACC/OpenMP, has anyone already made any thoughts about how actually >> using Independent Thread Scheduling would be beneficial? Can it be >> exploited via OpenACC/OpenMP directly? Can it somehow be used to >> increase performance? Can it be used to simplify parts of the GCC/nvptx >> back end implementation (without sacrifying performance -- is it a >> zero-cost abstraction, compared to "warp-synchronicity")? > > Is it something that is always enabled on sm_70 and later hw or does a PTX > program ask for independent thread scheduling?
As I understand it: always enabled; basically kind of a "hardware change". In quotes, because: for the time being, you might avoid it by not compiling for sm_7x (for example, compile for sm_6x, which does load on sm_7x hardware), but that will also prohibit you from other sm_7x features (not relevant right now, but eventually), and eventually support for sm_6x and earlier will be removed. So'll we have to get this addressed at some point. > If threads in the warp no longer execute in lockstep, then I think it is not > compliant to use the model we have for OpenMP with warps being OpenMP > threads and threads in warp being SIMD lanes and we'd need to switch to > have each thread in a warp being an OpenMP thread (so have 32 times more > threads than before) and only a single SIMD lane in each thread (i.e. SIMD > not used). Maybe I do understand your concern -- or maybe don't. Will you please provide an example? If there is direct PTX thread-level communication (for example, "shuffle" instructions, PTX old: 'shfl'/new: 'shfl.sync'), that (already and continues to) include corresponding synchronization, implicitly (old: all threads of a warp, new: 'membermask' to specify participating threads). So that continues to work as before, with unchanged user-visible semantics, and via 'membermask' of '0xffffffff' simply prohibits Independent Thread Scheduling (again, at least as far as user-visible, via communication instructions). The concern I'm having is whether there are bits in the nvptx back end where we use communication *without* the implicitly synchronizing PTX instructions, via device global or CTA shared memory, and currently rely on "warp-synchronicity" (that means: there may be divergent control flow, but it has been guaranteed that individual PTX threads don't advance their PC individually). This would then run into erroneous behavior with sm_70+, and we'd need to insert explicit PTX synchronization instructions (I suppose: 'bar.warp.sync': "Barrier synchronization for threads in a warp") (... which the PTX JIT would optimize out for pre-sm_70, due to implicit "warp-synchronicity"). For example, see section "Warp Synchronization" in <https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/>, or code pattern 2. in <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#independent-thread-scheduling-7-x>. (CUDA '__syncwarp' maps to PTX 'bar.warp.sync'.) So this concern would mostly (only?) relate to avoiding "Implicit Warp-Synchronous Programming" (see above, Google, etc.) in the nvptx back end-synthesized PTX code, via RTL or PTX code templates. Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955