To reiterate---my original concern was that the first RFC was proposing changes 
to target-independent part of TVM to add support for a very target-specific 
feature.  However, I do think that we can move this forward in way that would 
be overall useful.

Here is the outline of my thoughts on this.  Let me know what you think.

First, a couple of observations:
1. Architectures that support vectors can be assumed to also support vector 
predication.  I'm talking specifically about masked operations, and in 
particular about predicated loads and stores.
2. For ARM/AArch64, it may be beneficial to distinguish vectorization via 
fixed-length vectors from one via scalable vectors.  If this choice is to be 
made by auto-scheduling, it should be expressible in TIR.

What this RFC proposes is very close to allowing vectorization of countable 
loops with variable iteration count, and I insist that we keep this in mind as 
a goal.

The way that vectorization works right now is that a loop like
```
for (i : [0, 130)) {
  C[i] = A[i] + B[i]
  D[i] = A[i] * B[i]
}
```
will be replaced with statements
```
C[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] + B[Ramp(0, 1, 130)]
D[Ramp(0, 1, 130)] = A[Ramp(0, 1, 130)] * B[Ramp(0, 1, 130)]
```
The expressions within these statement are all `PrimExpr`, whose type must be 
expressible by `DataType`.  All parameters in `DataType` are compile-time 
integers, which means that a single statement can only represent vectors with a 
known number of lanes.  In other words, neither VIC nor VLA can be implemented 
without some changes.  These changes may be in how types are represented in 
`DataType`, or in how vectorization is done (or a combination of these two).

We are already considering a special value for `DataType::lanes` that would 
represent the yet-unknown vector length (VL).  Following Halide's approach to 
vectorization, I propose that we change vectorization to take an explicit 
vector length as a parameter.  As a special case for SVE, the scalable VL could 
be represented by the same constant we chose for `DataType::lanes`.  For 
compatibility with existing code, `stage.vectorize()` would be equivalent to 
`stage.vectorize(vector_length=iter_count)`, since currently only loops with 
known iteration count can be vectorized.  The argument value `vector_length=VL` 
would indicate using SVE.  With `vectorize(vector_length=32)`, the loop above 
would be turned into
```
for (i = [0, (130+31)/32) {
  // i-th vector is [32*i..32*(i+1))
  C[Ramp(32*i, 1, 32), pred=(Ramp(32*i, 1, 32) < Broadcast(130, 32))] = 
A[Ramp..., pred=...] + ...
  ...
}
```
If the loop iteration count changed from a known integer `130` to some 
expression `N`, the generated code would remain mostly the same: the structure 
does not depend on the fact that `130` is a compile-time constant.  Similarly 
the `32` indicating vector length could be replaced with the predefined value 
for "scalable vector length", with the only issue potentially with calculating 
the iteration count of the `for` loop above.  If we were to allow an explicit 
"stride" to `For`, the issue would go away (the RFC proposes something like 
that).

To summarize:
1. Introduce `kScalableVectorLaneMark` (as suggested by @tqchen).
2. Make vector length a parameter to `stage.vectorize`.
3. Introduce "predicate" to `BufferLoad` and `BufferStore`.
4. Allow non-unit strides in `For` loops (as per the RFC).

-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/18#issuecomment-1172632753
You are receiving this because you are subscribed to this thread.

Message ID: <apache/tvm-rfcs/pull/18/c1172632...@github.com>

Reply via email to