================
@@ -962,6 +962,109 @@ The ``griddepcontrol`` intrinsics allows the dependent 
grids and prerequisite gr
 For more information, refer 
 `PTX ISA 
<https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-griddepcontrol>`__.
 
+TCGEN05 family of Intrinsics
+----------------------------
+
+The llvm.nvvm.tcgen05.* intrinsics model the TCGEN05 family of instructions
+exposed by PTX. These intrinsics use 'Tensor Memory' (henceforth ``tmem``).
+NVPTX represents this memory using ``addrspace(6)`` and is always 32-bits.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory>`_.
+
+The tensor-memory pointers may only be used with the tcgen05 intrinsics.
+There are specialized load/store instructions provided (tcgen05.ld/st) to
+work with tensor-memory.
+
+For more information on tensor-memory load/store instructions, refer
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory-and-register-load-store-instructions>`_.
+
+All tcgen05 intrinsics use a ``null`` pointer in tmem address
+space as their last operand. This helps to preserve ordering among the tcgen05
+operations especially when the intrinsic lacks any tmem operands. This
+last operand is dropped during Codegen.
----------------
Artem-B wrote:

After reading PTX docs here's my understanding of the situation.
- there's a new kind of memory, so creating a separate AS for tmem is 
reasonable.
- tcgen05.alloc returns allocation result indirectly, by storing it in a shared 
memory. So LLVM has no direct indication that the intrinsic operates on tmem 
and affects both shared memory and tmem
- it's not clear from PTX docs what's the input for tcgen05.dealloc. It just 
says "The operand taddr must point to a previous [Tensor 
Memory](https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-memory) 
allocation" but I can't tell if that means the previous location in the shared 
memory where it stored a tmem pointer, or the tmem pointer itself. Judging by 
the proposed intrinsic signature, it's the latter. In this case LLVM knows that 
we're touching tmem.
- relinquish_alloc_permit blocks subsequent allocations, so it must not be 
reordered vs allocs.

So, the only odd thing is the allocation returning the result indirectly.

Proposed design adds artificial tmem pointer to let LLVM know that all tcgen05 
intrinsics operate on tmem and we can give LLVM sufficient hints on how they 
should be ordered. However, the dummy argument is a crutch. 

The gist of the problem here is that LLVM's existing intrinsic annotation is 
not flexible enough to describe what we have here, exactly. I.e. there's no way 
to tell LLVM that alloc and relinquish_alloc_permit operate on tmem. 
Our current options are to either make all intrinsics conservatively with 
`HasSideEffects` or, with a more relaxed "IntrInaccessibleMemOnly". I think the 
latter would be a reasonable trade-off for the time being.

A longer-term approach would be to add a new intrinsic property allowing to 
specify specific AS accessed by the intrinsic. E.g. we may extend existing 
`IntrWriteMem` and `IntrWriteMem`  to allow narrowing the scope to particular 
AS, and allow specifying more than one. E.g. alloc would indicate that it 
writes both shared and tmem.
I think that would be a useful addition to a handful of other intrinsics we 
already have, not just in NVPTX, but in the other back-ends that need to deal 
with multiple AS.


https://github.com/llvm/llvm-project/pull/124961
_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to