@tqchen Thanks for elaborating on the GPU programming model, I see the parallels between programming for variable number of threads and vectors with unknown lenghts. S1 option looks quite similar to what is described in this RFC, except using the scoping instead of marking the variable with `T.Vectorized`. What do you see as the benefits of using the scoping?
I should mention some of the technical goals we want to achieve that I have not mentioned a lot before: * Ability to mix fixed lenght and scalable vectors in same `PrimFuncs` * To make scalable vectors natural part of TVM's powerful sheduling, namely the various combinations of splitting, reordering loops and vectorizing * There are cases where we want to use scalable vectors and cases where we don't and it depends on the details of the sepcific hardware - so getting to the point where we can use the tuner in that decision making would be great Not really a technical goal, but it would be nice to reuse as much of the current TVM infrastructure as possible, e.g. all the arith rewrite rules also apply (except the ones that use the vector length as part of the simplification) and with the addition of mapping the `vscale` to `llvm.vscale` the LLVM codegen currently supports simple contiguous unpredicated loads, stores and binary operations pretty much out of the box. Speaking about reuse... > ```n = T.let(call(tvm.builtin.vscale(), ())``` Thanks for pointing this out! I'll do some further experimentation, but that combination of `call` and `let` seem to be sufficient to realize our goals. I don't want to introduce a new node if there is a decent existing alternative. > Generalizing things a bit, say we are looking into higher dimensional > instructions(e.g. SME), likely we need two or more variables (instead of a > single vscale). In SME we target the outer product engine by adderssing the same SVE vectors, so there is still just one `vscale` in the program. (Technically there is the streaming mode, which implies a different scalable vector lenght, but that is controlled by a processor state, so the different lenghts are not exposed to the software). In general though, I think it is good to have a way to express different scalable vector lenghts in the same code, it would make the implementation more general. Maybe few more words on SME, processor states etc... Our thinking so far has been influenced by the support of these extensions in LLVM. While for SVE all generic LLVM intrinsics are supported, there are various optimisations and it is pretty much treated just like another set of vector registers, SME is going to be targeted though AArch64 specific intrinsics only. So for SVE we'd like to continue using the optimisations at LLVM stage and deal in TVM with the things LLVM can't do, like high level loop reordering and tuning support. In SME, however, the plan is to use tensorize with microkernel style approach. The SME code would also need to execute in the streaming mode, so using the context infra there is definitley something to consider. I'll be away next well, but will look into making changes to the current proposal with the points we have agreed on so far after that. Also cc @neildhickey and his more substantial GPU experience. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/104#issuecomment-1703019039 You are receiving this because you are subscribed to this thread. Message ID: <apache/tvm-rfcs/pull/104/c1703019...@github.com>