> What I'm aiming at is to be able to lower the TIR to a generic CPU, that is 
> to an architecture that does not support SVE. The TIR will need to have some 
> default lowering in CodeGenLLVM/CodeGenCPU, so being able to do that is 
> important. For that, we should be able to assume that vscale is 1. The vscale 
> would simply be an indicator to the codegen (in TVM) that the code may be 
> lowered to SVE.

Right, I see... Would we get any benefit form mapping the scalable TIR vectors 
to fixed length LLVM vectors for targets that don't support scalable vectors? 
At least for Arm's SVE implementations, all access to scalable vectors should 
be intentional, in this RFC proposal directed by target dependent schedules 
(SVE is not preferable over fixed length vectors in all cases). I think if I'm 
compiling code with scalable vectors to a target that doesn't support it, I'd 
rather it errored out since something has gone wrong somewhere.

I was wondering if there is a case for schedules that would apply to all 
scalable architectures? My intuition would say no since the implementations are 
sufficiently different, but would be interesting to hear what others think. 

>  If you're sticking with vscale, then it may seem like we don't need it, but 
> the issue is with using "x * vscale" as an idiom: if you have several 
> occurrences of "4 * vscale" in an expression, it may end up being rearranged 
> to something like "(4*vi + 4) * vscale", or "ceildiv(128, 4 * vscale)" may 
> end up being "ceildiv(32, vscale)". So, instead of "x * vscale", I suggest 
> "vscale(x)".

Yes that's a good point. I'll have to think about it a bit more, but I tend to 
agree. Besides the case you mentioned, I can think of some additional upsides - 
it will help with reliably handling the scalable vectors in the TVM passes 
since checking if something is `vscale` is easier than checking if it is an 
expression involving `vscale`. It also makes it easier to enforce that if 
`lanes` in the ramp is not integer, it is `vscale` and not just anything. 
Shouldn't create significantly more complexity for the codegen either (just 
need to emit an extra multiply when encountering the `vscale`). So I think it 
would give us more robust implementation.

> Could it instead be in a target-dependent lowering pass? That is, since a 
> lowering pass after BindTarget (here in driver_api.cc) would know whether the 
> target CPU supports SVE or not, we could make a pass that either returns the 
> IRModule unmodified for CPUs that support SVE, or converts it to non-SVE 
> instructions otherwise.

I suppose this is also related to whether we want to implicitly convert to/from 
scalable vectors. I think it is a cool idea, maybe an optional (command line 
triggered) `IRModule` -> `IRModule` pass to turn the fixed length vectors into 
scalable vectors (or vice versa) that users can experiment without having to 
write schedules. I think this would be a future improvement, the goal of this 
RFC is to add the tools to the toolbox, give the TVM users access to the 
scalable vectors and to unblock SME (which will bring very significant 
performance improvements).

Regarding to predication... In my mind the changes to support predication are 
necessary for SVE, but in terms of the code changes tangential. So change 
`BufferLoad` and `BufferStore` nodes to accept a predicate and change the 
`LoopVectorizer` such that instead of scalarising the loop it can't exactly 
vectorize, it creates a predicated buffer operations. I haven't implemented it 
and I'm not much of a `BufferLoad` expert, so I might be missing something 
there, but to me it looks like predication could be used without any SVE infra.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/104#issuecomment-1753096151
You are receiving this because you are subscribed to this thread.

Message ID: <apache/tvm-rfcs/pull/104/c1753096...@github.com>

Reply via email to