> What I'm aiming at is to be able to lower the TIR to a generic CPU, that is > to an architecture that does not support SVE. The TIR will need to have some > default lowering in CodeGenLLVM/CodeGenCPU, so being able to do that is > important. For that, we should be able to assume that vscale is 1. The vscale > would simply be an indicator to the codegen (in TVM) that the code may be > lowered to SVE.
Right, I see... Would we get any benefit form mapping the scalable TIR vectors to fixed length LLVM vectors for targets that don't support scalable vectors? At least for Arm's SVE implementations, all access to scalable vectors should be intentional, in this RFC proposal directed by target dependent schedules (SVE is not preferable over fixed length vectors in all cases). I think if I'm compiling code with scalable vectors to a target that doesn't support it, I'd rather it errored out since something has gone wrong somewhere. I was wondering if there is a case for schedules that would apply to all scalable architectures? My intuition would say no since the implementations are sufficiently different, but would be interesting to hear what others think. > If you're sticking with vscale, then it may seem like we don't need it, but > the issue is with using "x * vscale" as an idiom: if you have several > occurrences of "4 * vscale" in an expression, it may end up being rearranged > to something like "(4*vi + 4) * vscale", or "ceildiv(128, 4 * vscale)" may > end up being "ceildiv(32, vscale)". So, instead of "x * vscale", I suggest > "vscale(x)". Yes that's a good point. I'll have to think about it a bit more, but I tend to agree. Besides the case you mentioned, I can think of some additional upsides - it will help with reliably handling the scalable vectors in the TVM passes since checking if something is `vscale` is easier than checking if it is an expression involving `vscale`. It also makes it easier to enforce that if `lanes` in the ramp is not integer, it is `vscale` and not just anything. Shouldn't create significantly more complexity for the codegen either (just need to emit an extra multiply when encountering the `vscale`). So I think it would give us more robust implementation. > Could it instead be in a target-dependent lowering pass? That is, since a > lowering pass after BindTarget (here in driver_api.cc) would know whether the > target CPU supports SVE or not, we could make a pass that either returns the > IRModule unmodified for CPUs that support SVE, or converts it to non-SVE > instructions otherwise. I suppose this is also related to whether we want to implicitly convert to/from scalable vectors. I think it is a cool idea, maybe an optional (command line triggered) `IRModule` -> `IRModule` pass to turn the fixed length vectors into scalable vectors (or vice versa) that users can experiment without having to write schedules. I think this would be a future improvement, the goal of this RFC is to add the tools to the toolbox, give the TVM users access to the scalable vectors and to unblock SME (which will bring very significant performance improvements). Regarding to predication... In my mind the changes to support predication are necessary for SVE, but in terms of the code changes tangential. So change `BufferLoad` and `BufferStore` nodes to accept a predicate and change the `LoopVectorizer` such that instead of scalarising the loop it can't exactly vectorize, it creates a predicated buffer operations. I haven't implemented it and I'm not much of a `BufferLoad` expert, so I might be missing something there, but to me it looks like predication could be used without any SVE infra. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/104#issuecomment-1753096151 You are receiving this because you are subscribed to this thread. Message ID: <apache/tvm-rfcs/pull/104/c1753096...@github.com>