> Usage of te.AXIS_SEPARATOR: It seems this is only used in the API side but 
> not in BufferTransform, would be good to get some clarification.

That's correct, the `te.AXIS_SEPARATOR` only appears in the API for the TE 
schedules, and not in the TIR graph generated from the TE schedule.  I've 
updated the RFC with a paragraph on this distinction.

For the axes returned, I'd want to add a third option.

* T2: Using `te.AXIS_SEPARATOR` to separate groups of axes, e.g. `AA = 
s[A].transform_layout(lambda i, j: [j//4, te.AXIS_SEPARATOR, i, j%4])`, 
directly implies fusion of the `i` and `j%4` axes, but exposes each transformed 
axis for use in the TE schedule.  That is, `AA.op.axis` has three elements, 
whose iteration bounds are `[A.shape[1]//4, A.shape[0], 4]`.

Then my thoughts on each option are below, with T2 as my preferred option.

* T0: This is easy to write in te, but any additional manipulation of the axis 
(e.g. changing the loop ordering independent of the data layout) would require 
splitting the returned axis to match the layout transformation.  It would also 
mean that `lambda *indices: indices` isn't the identity function, because it 
would flatten the buffer as part of that transformation.
* T1: As you mentioned, this would require explicitly writing down the 
transformation to fuse the axes, which would be error-prone.  It would also 
make the default behavior, where all buffer axes are fused into a single index, 
be more of a special case.
* T2: Doesn't require the user to specify the transformation to fuse the axes, 
and allows the corresponding loops to be easily reorganized.  The fusion in the 
default case, where no `transform_layout` is applied, then follows the same 
pattern, where a `BufferTransform` node is inserted to fuse all remaining axes 
into `num_axis_separators+1 == 1` axes.


> From TIR's perspective, the same primitive can be introduced to TIR as well. 
> It will be an eager transformation (without needing to create BufferTransform 
> node) that rewrite the both producer/consumer immediately upon calling it. 
> This is required if we want to further scheduling the transformed block. I 
> can follow up implementing it in TIR later

When the transformation is performed, would that include fusing axes together?  
If it does, then the `FlattenBuffer` pass would need distinguish between a 2-d 
buffer that should be flattened to a flat 1-d buffer, and a 2-d buffer that has 
already been flattened from some larger dimension.  If it doesn't, then the 
`FlattenBuffer` pass would need to distinguish between a N-d buffer that should 
be flattened to a 1-d buffer, and an N-d buffer that should be flattened to a 
2-d buffer.  In either case, `FlattenBuffer` would need some additional 
metadata to know how the buffer should be handled.

I like the idea of having the transformation be done eagerly, to avoid needing 
to pass through additional information, but I'm not quite seeing how it could 
be done.  (I had been trying an initial implementation that way on the TE side.)

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/39#issuecomment-959696021

Reply via email to