To reiterate---my original concern was that the first RFC was proposing changes to target-independent part of TVM to add support for a very target-specific feature. However, I do think that we can move this forward in way that would be overall useful.
Here is the outline of my thoughts on this. Let me know what you think. First, a couple of observations: 1. Architectures that support vectors can be assumed to also support vector predication. I'm talking specifically about masked operations, and in particular about predicated loads and stores. 2. For ARM/AArch64, it may be beneficial to distinguish vectorization via fixed-length vectors from one via scalable vectors. If this choice is to be made by auto-scheduling, it should be expressible in TIR. What this RFC proposes is very close to allowing vectorization of countable loops with variable iteration count, and I insist that we keep this in mind as a goal. The way that vectorization works right now is that a loop like ``` for (i : [0, 130)) { C[i] = A[i] + B[i] D[i] = A[i] * B[i] } ``` will be replaced with statements ``` C[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] + B[Ramp(0, 1, 130)] D[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] * B[Ramp(0, 1, 130)] ``` The expressions within these statement are all `PrimExpr`, whose type must be expressible by `DataType`. All parameters in `DataType` are compile-time integers, which means that a single statement can only represent vectors with a known number of lanes. In other words, neither VIC nor VLA can be implemented without some changes. These changes may be in how types are represented in `DataType`, or in how vectorization is done (or a combination of these two). We are already considering a special value for `DataType::lanes` that would represent the yet-unknown vector length (VL). Following Halide's approach to vectorization, I propose that we change vectorization to take an explicit vector length as a parameter. As a special case for SVE, the scalable VL could be represented by the same constant we chose for `DataType::lanes`. For compatibility with existing code, `stage.vectorize()` would be equivalent to `stage.vectorize(vector_length=iter_count)`, since currently only loops with known iteration count can be vectorized. The argument value `vector_length=VL` would indicate using SVE. With `vectorize(vector_length=32)`, the loop above would be turned into ``` for (i = [0, (130+31)/32) { // i-th vector is [32*i..32*(i+1)) C[Ramp(32*i, 1, 32), pred=(Ramp(32*i, 1, 32) < Broadcast(130, 32))] = A[Ramp..., pred=...] + ... ... } ``` If the loop iteration count changed from a known integer `130` to some expression `N`, the generated code would remain mostly the same: the structure does not depend on the fact that `130` is a compile-time constant. Similarly the `32` indicating vector length could be replaced with the predefined value for "scalable vector length", with the only issue potentially with calculating the iteration count of the `for` loop above. If we were to allow an explicit "stride" to `For`, the issue would go away (the RFC proposes something like that). To summarize: 1. Introduce `kScalableVectorLaneMark` (as suggested by @tqchen). 2. Make vector length a parameter to `stage.vectorize`. 3. Introduce "predicate" to `BufferLoad` and `BufferStore`. 4. Allow non-unit strides in `For` loops (as per the RFC). -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/18#issuecomment-1172632753 You are receiving this because you are subscribed to this thread. Message ID: <apache/tvm-rfcs/pull/18/c1172632...@github.com>