Thanks @Lunderberg for the update, I think we are moving towards positive direction of overall IR design. Some additional feedbacks:
## Keep Schedule Decisions Local to PrimFunc then Compose On schedule primitives, to be pragmatic, it would be helpful to have some of the cross PrimFunc re-flowing done in two steps. Specifically, some of your `transform_layout` example of the functions touches buffers that involves input. One approach is of course to trace up to its producers and then rewrite the producers function as well (or trace down to consumers functions). However, the complication here is that: - There can be multiple consumers/producer TIR functions - In certain cases producer/consumer may not have consistent requirements. - The producer/consumer themselves can have their own local layout preferences that needs to be consolidated. In general it is helpful to first keep schedule decision local, e.g. introducing a caching stage (AC, BC in the example), the compose with another reflowing pass to bring the decision to consumer/producers. This is mainly to reduce the overall complexity in implementing such transformations, and also makes things more modular. ``` @T.prim_func def grow(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]): AC = T.alloc_buffer([4, 4], "int32") BC = T.alloc_buffer([4, 4], "int32") for io, ii in T.grid(4, 4): with T.block(): T.block_attr("preproc", "pad") AC[io, ii] = if_then_else(4 * io + ii < 14, A[4 * io + ii], 0) for i, j in T.grid(4, 4): BC[i, j] = 2 * AC[i, j] for io, ii in T.grid(14): with T.block(): T.block_attr("postproc", ["crop", 0]) B[io, ii] = BC[4 * io + ii] @T.prim_func def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]): for i in T.grid(14): B[i] = A[i] + 1 @R.func def main(A: T.Tensor[14, "int32"]): lv0 = call_tir(grow, [A], (14)) # an intermdiate stage to show non-local reflowing lv1 = call_tir(addone, [lv0], (14)) lv2 = call_tir(grow, [lv1], (14)) ... ``` ## Use IfThenElse expression for Padding. While it is possible to express padding with a loop and another loop that writes the padded value, it is harder to schedule the resulting blocks as there are more than one producers. Having a single loop and use `T.if_then_else ` will express such pattern in a single shot and makes future rewriting easier. ```python for io, ii in T.grid(4, 4): with T.block(): T.block_attr("preproc", "pad") AC[io, ii] = if_then_else(4 * io + ii < 14, A[4 * io + ii], 0) ``` ## Propagate Padding Decisions from the End. Some of the complications of duplicated condition(and their simplification) roots from the fact that we do layout transform of output and input separately(each introducing their own conditions which then needs to be simplified). It might be helpful to do a global transformation, usually driven from the output, then "backprop" the implication of that decisions to the input. Doing such transformation at a single shot will likely alleviate the need of generating extra conditions then simplifying them. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1170294348 You are receiving this because you are subscribed to this thread. Message ID: <apache/tvm-rfcs/pull/77/c1170294...@github.com>