Thanks for your comments @tqchen, much appreciated! I want to ask some clarifications and expand on some of the points you made, based on my understanding.
TL;DR: - We need to be able to express `vscale` dependent `extent`s in the TIR `For` nodes - Aside of predication, SVE vectors are not much different to the fixed length vectors, especially in terms of how they are represented in LLVM. The existing TVM infrastructure lends itself quite well to the scalable vector support. Here's a small LLVM example with the scalable vectors that adds two vectors (without the cleanup loop): ``` entry: tail call void @llvm.memset.p0.i64(ptr noundef nonnull align 4 dereferenceable(4000) %agg.result, i8 0, i64 4000, i1 false) %0 = tail call i64 @llvm.vscale.i64() %.neg = mul nuw nsw i64 %0, 1016 %n.vec = and i64 %.neg, 1000 %1 = tail call i64 @llvm.vscale.i64() %2 = shl nuw nsw i64 %1, 2 %3 = tail call i64 @llvm.vscale.i64() %4 = shl nuw nsw i64 %3, 2 %5 = tail call i64 @llvm.vscale.i64() %6 = shl nuw nsw i64 %5, 2 %7 = tail call i64 @llvm.vscale.i64() %8 = shl nuw nsw i64 %7, 3 br label %vector.body vector.body: %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ] %9 = getelementptr inbounds i32, ptr %arr0, i64 %index %wide.load = load <vscale x 4 x i32>, ptr %9, align 4 %10 = getelementptr inbounds i32, ptr %9, i64 %2 %wide.load9 = load <vscale x 4 x i32>, ptr %10, align 4 %11 = getelementptr inbounds i32, ptr %arr1, i64 %index %wide.load10 = load <vscale x 4 x i32>, ptr %11, align 4 %12 = getelementptr inbounds i32, ptr %11, i64 %4 %wide.load11 = load <vscale x 4 x i32>, ptr %12, align 4 %13 = add nsw <vscale x 4 x i32> %wide.load10, %wide.load %14 = add nsw <vscale x 4 x i32> %wide.load11, %wide.load9 %15 = getelementptr inbounds [1000 x i32], ptr %agg.result, i64 0, i64 %index store <vscale x 4 x i32> %13, ptr %15, align 4 %16 = getelementptr inbounds i32, ptr %15, i64 %6 store <vscale x 4 x i32> %14, ptr %16, align 4 %index.next = add nuw i64 %index, %8 %17 = icmp eq i64 %index.next, %n.vec br i1 %17, label %middle.block, label %vector.body ``` That is similar to the LLVM we need to lower to. > I think we should use tir intrinsics(as opposed to a new node, which would > add extra burdens in the IR) I'll assume that you meant the intrinsics like the ones defined in https://github.com/apache/tvm/blob/main/include/tvm/tir/builtin.h - I could see `vscale` or similar being defined as intrinsic since it is something that just needs to be matched to `llvm.vscale` in the codegen. However, from a bit of experimentation, the main problem I see there is around expressing `vscale` dependent arithmetic. When we map fixed shape tensors to scalable vectors, the `extent` of the `For` node will become an expression involving `vscale`, so we need to be able to include `vscale` into artihmetic expressions. It looks like the intrinsics are passed around as `Call` or `Op` nodes, which don't mix well with `PrimExpr`. In that sense, another node seems like much less invasive change. Let me know if I have missed something there. > In general, it might be useful to know the information that a value is > multiple of something (e.g. 128), so having something like x * 128 might help I'll assume there that you are referring to whether it's better to use `vfactor` or something like `4 * vscale`. Happy to go with `vscale`, I don't have a strong preference there. > I would still love us think about tensorization support in the codegen with > some form of loop annotation (without explicit vector dtypes), as they will > generalize across to more complex operations. Do you mean lowering loops into something like ``` @T.prim_func def main(A: T.Buffer((50, 50), "float32"), B: T.Buffer((50, 50), "float32")): for i in T.VectorizeScalable(50): for j in range(50): B_1 = T.Buffer((2500,), data=B.data) A_1 = T.Buffer((2500,), data=A.data) B_1[i * 50 + j] = A_1[i * 50 + j] ``` out of which we can create the SVE vectors in the codegen? It is something we can think about, however, it is not clear to me why we would want to treat vectorizing for SVE differently to Neon. The decision to vectorize would still need to be made in the scheduling and during the TIR passes we would have an awkward situation where some vector operations are represented as ramps and others as hypothetical vectors that only come into existence during codegen. We'd miss out on the optimisations and simplifications in the lowering pipeline. Can you bring an example of the more complex operation you are referring to? > One possible way to think about SVE is perhaps drawing inspiration from CUDA > programming, I am not familar with CUDA programming - can you point me to a relevant reference? -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/104#issuecomment-1699105121 You are receiving this because you are subscribed to this thread. Message ID: <apache/tvm-rfcs/pull/104/c1699105...@github.com>