it might be useful also bring some discussions to forums. here is a quick related sketch of GPU related models
```python for y in range(64): for x in range(64): C[y, x] = A[y, x] * (B[y] + 1) ``` Say we are interested in the original program. In a normal GPU programming terminology, we will map the compute of x to "threads", there `tid` is the thread index. In GPU programming there is also different memory scopes (i am using cuda terminology here): - local: the variable is local to each thread - shared: the variable is "shared" across threads, concurrent writing different values to the same shared variable is somewhat undefined. - warp shuffle: sometimes we might need to exchange data(e.g. take sum) across the threads, and it is done through shuffle instructions(like warp.all_reduce). ```python for y in range(64): for x in range(64 // n): for tid in T.scalable_vectorized_as_threads(n): a0: local = A[y, tid + n * x] b0: shared = B[y] b1: shared = b0 + 1 c0: local = a0 * b0 C[y, tid + n * 4 * i] = c0 ```` The above code is a rough sketch of what it might looks like. Now, it might also be possible to produce a similar more "vector-view" version using the following rule: - local <=> vector<vscale> - shared <=> normal register ```python # note vscale = n for y in range(64): for x in range(64 // n): with T.sve_scope(n): a0: vector<vscale> = A[y, tid + n * x] b0: scalar = B[y] b1: vector<vscale> = b0 + 1 c0: scalar = a0 * b0 C[y, tid + n * 4 * i] = c0 ``` They are not that different. But one thing is true: we do need to be able to identify the vector dtype differently from the scalar dtype(or in the case of GPU programming local from shared). Being able to mark a dtype as ScalableVectorMark seems to serve that purpose. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/104#issuecomment-1699173201 You are receiving this because you are subscribed to this thread. Message ID: <apache/tvm-rfcs/pull/104/c1699173...@github.com>