Thank you for the comments, @vinx13.

> For example, another way is to use a mapping function: (n, c, h, w) -> (n, 
> tir.floordiv(c, 4), h, w, tir.floormod(c, 4)). This would allow arbitrary 
> mapping (we can add more restrictions like requiring affine mapping though, 
> to make analysis easier).

~~Having an arbitrary mapping function was something that I considered as an 
alternative, and would like to use as part of the internal representation.  
However, for the user-facing API, I prefer having a representation that cannot 
express an illegal internal state, rather than relying on validation.  The 
arbitrary mapping function could have more than one logical index map to the 
same physical index (e.g. `(i,j) -> (i+j,)`), which would need to be caught 
after the fact.~~

Never mind, I'm convinced after going through your point below about CUDA's 
permuted layout.

> A possible use cases of more complex mapping is [permuted 
> layout](https://github.com/NVIDIA/cutlass/blob/master/media/docs/implicit_gemm_convolution.md#shared-memory-layouts)
>  for shared memory on CUDA.

Thank you for the example, and this sort of example was what I was hoping to 
get in order to determine if the representation of reorder/splits is 
sufficiently flexible.  If I'm reading the link correctly, I agree that the 
physical layout in shared memory cannot be expressed in the proposed notation.  
If the global memory is 2-d memory `(i,j)`, the closest to representing 
physical layout in shared memory would be either `((i,2), (j,4), (j,8), SEP, 
(i,4))`.  This would maintain the 128-bit vectors `(j,8)`, transpose because 
`(j,8)` appears in the first physical coordinate, but wouldn't handle the XOR 
condition needed to have the conflict-free access.  It also couldn't specify 
that the `(i,2)` in the first physical coordinate would need to take precedence 
over the `(i,4)` in the second physical coordinate.  In the functional 
notation, these would be `(i,j) -> (floormod(i,2), (floormod(row,8)) ^ 
(floordiv(j,8)), floormod(j,8), SEP, row)`, where `row = floordiv(i, 2)`.

I am convinced, while the earlier notation may be useful as a more readable 
input in some cases, it doesn't cover all use cases.  At most, it could be an 
alternative form of input, but the function definition mapping from logical 
indices to physical indices should be the primary representation.

In terms of how to represent this in the TIR graph, I'll have to think a bit on 
it.  I'm picturing either `BufferNode` holding `Array<Var> index_vars` and 
`Array<IterSumExpr> physical_layout`, or making a TIR variant of 
`tvm::relay::FunctionNode`.

> Also, there are related affine analysis infrastructure available, it would be 
> great if we can reuse it for loop analysis and rewriting.

Thank you for mentioning these.  I intend to use these utilities for the 
implementation, and will add it to the detail section.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/39#issuecomment-936560130

Reply via email to