@tqchen Thanks for elaborating on the GPU programming model, I see the 
parallels between programming for variable number of threads and vectors with 
unknown lenghts. S1 option looks quite similar to what is described in this 
RFC, except using the scoping instead of marking the variable with 
`T.Vectorized`. What do you see as the benefits of using the scoping?

I should mention some of the technical goals we want to achieve that I have not 
mentioned a lot before:
* Ability to mix fixed lenght and scalable vectors in same `PrimFuncs` 
* To make scalable vectors natural part of TVM's powerful sheduling, namely the 
various combinations of splitting, reordering loops and vectorizing
* There are cases where we want to use scalable vectors and cases where we 
don't and it depends on the details of the sepcific hardware - so getting to 
the point where we can use the tuner in that decision making would be great

Not really a technical goal, but it would be nice to reuse as much of the 
current TVM infrastructure as possible, e.g. all the arith rewrite rules also 
apply (except the ones that use the vector length as part of the 
simplification) and with the addition of mapping the `vscale` to `llvm.vscale` 
the LLVM codegen currently supports simple contiguous unpredicated loads, 
stores and binary operations pretty much out of the box. 


Speaking about reuse...

> ```n = T.let(call(tvm.builtin.vscale(), ())```

Thanks for pointing this out! I'll do some further experimentation, but that 
combination of `call` and `let` seem to be sufficient to realize our goals. I 
don't want to introduce a new node if there is a decent existing alternative. 

> Generalizing things a bit, say we are looking into higher dimensional 
> instructions(e.g. SME), likely we need two or more variables (instead of a 
> single vscale).

In SME we target the outer product engine by adderssing the same SVE vectors, 
so there is still just one `vscale` in the program. (Technically there is the 
streaming mode, which implies a different scalable vector lenght, but that is 
controlled by a processor state, so the different lenghts are not exposed to 
the software). In general though, I think it is good to have a way to express 
different scalable vector lenghts in the same code, it would make the 
implementation more general. 

Maybe few more words on SME, processor states etc... Our thinking so far has 
been influenced by the support of these extensions in LLVM. While for SVE all 
generic LLVM intrinsics are supported, there are various optimisations and it 
is pretty much treated just like another set of vector registers, SME is going 
to be targeted though AArch64 specific intrinsics only. So for SVE we'd like to 
continue using the optimisations at LLVM stage and deal in TVM with the things 
LLVM can't do, like high level loop reordering and tuning support. In SME, 
however, the plan is to use tensorize with microkernel style approach. The SME 
code would also need to execute in the streaming mode, so using the context 
infra there is definitley something to consider.

I'll be away next well, but will look into making changes to the current 
proposal with the points we have agreed on so far after that. Also cc 
@neildhickey and his more substantial GPU experience.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/104#issuecomment-1703019039
You are receiving this because you are subscribed to this thread.

Message ID: <apache/tvm-rfcs/pull/104/c1703019...@github.com>

Reply via email to