Hi!

On 2021-07-13T17:59:43+0200, Jakub Jelinek <ja...@redhat.com> wrote:
> On Tue, Jul 13, 2021 at 05:48:51PM +0200, Thomas Schwinge wrote:
>> Starting with the Volta family (sm_70+), Nvidia GPUs introduced
>> Independent Thread Scheduling for the 32 threads ("32 SIMD lanes") that
>> constitute a warp, which means "execution state per thread, including a
>> program counter", succeeding the previous "warp-synchronous" abstraction
>> where "warps used a single program counter shared amongst all 32 threads
>> in the warp together with an active mask specifying the active threads of
>> the warp".  See
>> <https://docs.nvidia.com/cuda/parallel-thread-execution/#independent-thread-scheduling>,
>> <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture>,
>> <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#independent-thread-scheduling-7-x>,
>> etc.
>>
>> Per PR96005 commit 2a1586401a21dcd43e0f904bb6eec26c8b2f366b
>> "[nvptx] Add -mptx=3.1/6.3", Tom has already added implemented the
>> necessary 'shfl' -> 'shfl.sync' and 'vote' -> 'vote.sync' changes,
>> hard-coding a 'membermask' of '0xffffffff' (all threads participate).
>> This I understand to be the direct translation, avoiding the
>> deprecated/removed non-'.sync' variants of these instructions, but
>> otherwise maintaining the pre-Independent Thread Scheduling semantics (as
>> well as performance level, supposedly).  (Unless there are further
>> changes relevant to GCC/nvptx that I'm not currently seeing?) this means
>> that we now comply to the sm_70+ Independent Thread Scheduling
>> requirements -- but don't actually use its capabilities.
>>
>> Now, I haven't spent much thought on it yet, but it would seem to me (gut
>> feeling?) that continuing to maintain "warp-synchronicity" (that is,
>> avoid using Independent Thread Scheduling) should still yield best
>> performance?  Or, given the GCC/nvptx offloading context via
>> OpenACC/OpenMP, has anyone already made any thoughts about how actually
>> using Independent Thread Scheduling would be beneficial?  Can it be
>> exploited via OpenACC/OpenMP directly?  Can it somehow be used to
>> increase performance?  Can it be used to simplify parts of the GCC/nvptx
>> back end implementation (without sacrifying performance -- is it a
>> zero-cost abstraction, compared to "warp-synchronicity")?
>
> Is it something that is always enabled on sm_70 and later hw or does a PTX
> program ask for independent thread scheduling?

As I understand it: always enabled; basically kind of a "hardware
change".  In quotes, because: for the time being, you might avoid it by
not compiling for sm_7x (for example, compile for sm_6x, which does load
on sm_7x hardware), but that will also prohibit you from other sm_7x
features (not relevant right now, but eventually), and eventually support
for sm_6x and earlier will be removed.  So'll we have to get this
addressed at some point.


> If threads in the warp no longer execute in lockstep, then I think it is not
> compliant to use the model we have for OpenMP with warps being OpenMP
> threads and threads in warp being SIMD lanes and we'd need to switch to
> have each thread in a warp being an OpenMP thread (so have 32 times more
> threads than before) and only a single SIMD lane in each thread (i.e. SIMD
> not used).

Maybe I do understand your concern -- or maybe don't.  Will you please
provide an example?

If there is direct PTX thread-level communication (for example, "shuffle"
instructions, PTX old: 'shfl'/new: 'shfl.sync'), that (already and
continues to) include corresponding synchronization, implicitly (old: all
threads of a warp, new: 'membermask' to specify participating threads).
So that continues to work as before, with unchanged user-visible
semantics, and via 'membermask' of '0xffffffff' simply prohibits
Independent Thread Scheduling (again, at least as far as user-visible,
via communication instructions).


The concern I'm having is whether there are bits in the nvptx back end
where we use communication *without* the implicitly synchronizing PTX
instructions, via device global or CTA shared memory, and currently rely
on "warp-synchronicity" (that means: there may be divergent control flow,
but it has been guaranteed that individual PTX threads don't advance
their PC individually).  This would then run into erroneous behavior with
sm_70+, and we'd need to insert explicit PTX synchronization instructions
(I suppose: 'bar.warp.sync': "Barrier synchronization for threads in a
warp") (... which the PTX JIT would optimize out for pre-sm_70, due to
implicit "warp-synchronicity").

For example, see section "Warp Synchronization" in
<https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/>, or
code pattern 2. in
<https://docs.nvidia.com/cuda/cuda-c-programming-guide/#independent-thread-scheduling-7-x>.
(CUDA '__syncwarp' maps to PTX 'bar.warp.sync'.)

So this concern would mostly (only?) relate to avoiding "Implicit
Warp-Synchronous Programming" (see above, Google, etc.) in the nvptx back
end-synthesized PTX code, via RTL or PTX code templates.


Grüße
 Thomas
-----------------
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955

Reply via email to