Re: [apache/tvm-rfcs] [RFC] Scalable vectors in TIR (PR #104)

Elen Kalda Wed, 30 Aug 2023 05:49:12 -0700

Thanks for your comments @tqchen, much appreciated! I want to ask some 
clarifications and expand on some of the points you made, based on my 
understanding.


TL;DR:
- We need to be able to express `vscale` dependent `extent`s in the TIR `For` 
nodes
- Aside of predication, SVE vectors are not much different to the fixed length 
vectors, especially in terms of how they are represented in LLVM. The existing 
TVM infrastructure lends itself quite well to the scalable vector support.

Here's a small LLVM example with the scalable vectors that adds two vectors 
(without the cleanup loop):
```
entry:
  tail call void @llvm.memset.p0.i64(ptr noundef nonnull align 4 
dereferenceable(4000) %agg.result, i8 0, i64 4000, i1 false)
  %0 = tail call i64 @llvm.vscale.i64()
  %.neg = mul nuw nsw i64 %0, 1016
  %n.vec = and i64 %.neg, 1000
  %1 = tail call i64 @llvm.vscale.i64()
  %2 = shl nuw nsw i64 %1, 2
  %3 = tail call i64 @llvm.vscale.i64()
  %4 = shl nuw nsw i64 %3, 2
  %5 = tail call i64 @llvm.vscale.i64()
  %6 = shl nuw nsw i64 %5, 2
  %7 = tail call i64 @llvm.vscale.i64()
  %8 = shl nuw nsw i64 %7, 3
  br label %vector.body

vector.body:
  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
  %9 = getelementptr inbounds i32, ptr %arr0, i64 %index
  %wide.load = load <vscale x 4 x i32>, ptr %9, align 4
  %10 = getelementptr inbounds i32, ptr %9, i64 %2
  %wide.load9 = load <vscale x 4 x i32>, ptr %10, align 4
  %11 = getelementptr inbounds i32, ptr %arr1, i64 %index
  %wide.load10 = load <vscale x 4 x i32>, ptr %11, align 4
  %12 = getelementptr inbounds i32, ptr %11, i64 %4
  %wide.load11 = load <vscale x 4 x i32>, ptr %12, align 4
  %13 = add nsw <vscale x 4 x i32> %wide.load10, %wide.load
  %14 = add nsw <vscale x 4 x i32> %wide.load11, %wide.load9
  %15 = getelementptr inbounds [1000 x i32], ptr %agg.result, i64 0, i64 %index
  store <vscale x 4 x i32> %13, ptr %15, align 4
  %16 = getelementptr inbounds i32, ptr %15, i64 %6
  store <vscale x 4 x i32> %14, ptr %16, align 4
  %index.next = add nuw i64 %index, %8
  %17 = icmp eq i64 %index.next, %n.vec
  br i1 %17, label %middle.block, label %vector.body
```
That is similar to the LLVM we need to lower to. 


> I think we should use tir intrinsics(as opposed to a new node, which would 
> add extra burdens in the IR)

I'll assume that you meant the intrinsics like the ones defined in 
https://github.com/apache/tvm/blob/main/include/tvm/tir/builtin.h - I could see 
`vscale` or similar being defined as intrinsic since it is something that just 
needs to be matched to `llvm.vscale` in the codegen. However, from a bit of 
experimentation, the main problem I see there is around expressing `vscale` 
dependent arithmetic. When we map fixed shape tensors to scalable vectors, the 
`extent` of the `For` node will become an expression involving `vscale`, so we 
need to be able to include `vscale` into artihmetic expressions. It looks like 
the intrinsics are passed around as `Call` or `Op` nodes, which don't mix well 
with `PrimExpr`. In that sense, another node seems like much less invasive 
change. Let me know if I have missed something there. 


> In general, it might be useful to know the information that a value is 
> multiple of something (e.g. 128), so having something like x * 128 might help

I'll assume there that you are referring to whether it's better to use 
`vfactor` or something like `4 * vscale`. Happy to go with `vscale`, I don't 
have a strong preference there. 

> I would still love us think about tensorization support in the codegen with 
> some form of loop annotation (without explicit vector dtypes), as they will 
> generalize across to more complex operations.

Do you mean lowering loops into something like 
```
@T.prim_func
def main(A: T.Buffer((50, 50), "float32"), B: T.Buffer((50, 50), "float32")):
    for i in T.VectorizeScalable(50):
        for j in range(50):
            B_1 = T.Buffer((2500,), data=B.data)
            A_1 = T.Buffer((2500,), data=A.data)
            B_1[i * 50 + j] = A_1[i * 50 + j]
```
out of which we can create the SVE vectors in the codegen? It is something we 
can think about, however, it is not clear to me why we would want to treat 
vectorizing for SVE differently to Neon. The decision to vectorize would still 
need to be made in the scheduling and during the TIR passes we would have an 
awkward situation where some vector operations are represented as ramps and 
others as hypothetical vectors that only come into existence during codegen. 
We'd miss out on the optimisations and simplifications in the lowering 
pipeline. Can you bring an example of the more complex operation you are 
referring to? 

> One possible way to think about SVE is perhaps drawing inspiration from CUDA 
> programming,

I am not familar with CUDA programming - can you point me to a relevant 
reference? 



-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/104#issuecomment-1699105121
You are receiving this because you are subscribed to this thread.

Message ID: <apache/tvm-rfcs/pull/104/c1699105...@github.com>

Re: [apache/tvm-rfcs] [RFC] Scalable vectors in TIR (PR #104)

Reply via email to