> In general it is helpful to first keep schedule decision local, e.g. > introducing a caching stage (AC, BC in the example), the compose with another > reflowing pass to bring the decision to consumer/producers.
My goal with the latest update wasn't to require global decisions, but to make local changes only, which could be used in different contexts. For the auto-scheduler, since the context requires maintaining the same PrimFunc interface, local optimization would be restricted to transformations of the caching stage. For stand-alone usage, such as preparing a single PrimFunc for a unit test, the context allows the interface to change. That way, the restrictions to the transformations are imposed by the level of abstraction that requires them. > While it is possible to express padding with a loop and another loop that > writes the padded value, it is harder to schedule the resulting blocks as > there are more than one producers. Having a single loop and use > `T.if_then_else` will express such pattern in a single shot and makes future > rewriting easier. I definitely agree that this makes the later analysis/rewrites easier. I had maintained them as two separate loops both to minimize the extent of changes being made in any one scheduling change, and to maintain the current behavior of `Schedule.transform_layout` which does not alter the surrounding loop iterators ([previous conversation](https://github.com/apache/tvm/pull/10538#discussion_r826209815) with @vinx13). I see four main options on how the loopnests could be handled: 1. When a buffer is transformed, all loops in the producer stage over the buffers's pre-transformation axes are replaced with loops over the buffer's post-transformation spatial dimensions. It is an error if this replacement cannot be done (e.g. the pre-transformation loops have been fused/split/reordered). - Pro: Allows the `T.if_then_else` to be inserted at the time of the transformations. - Pro: Removes the need for the . - Con: May restrict search space, since earlier manipulation of the loop iterators would prevent later buffer transformations. - Con: Doesn't help consumers of a transformed buffer. In a reduction, may be desirable to iterate over the input buffer, but this couldn't be expressed in terms of an output. - Con: For buffers whose padding is not written to, must either insert a conditional statement or maintain the pre-transformation loop structure. 2. When a buffer is transformed, an attempt is made to replace all loops in the producer stage over the buffers's pre-transformation axes with loops over the buffer's post-transformation spatial dimensions. If this replacement cannot be done (e.g. the pre-transformation loops have been fused/split/reordered), and if `pad_value` is not `None`, then an error should be raised. - Pro: Always valid to apply a transform - Pro: Avoids undoing scheduling benefits from previous changes to iterators. - Pro: Later calls to `reduce_branching_through_overcompute` could still introduce a value for the padding, if the full life cycle of the buffer is known. - Con: Allowing the follow-up stage at all requires just as much analysis to identify as if it were always present. 3. When a buffer is transformed, all loops in the producer stage over the buffers's pre-transformation axes are replaced with loops over the buffer's post-transformation spatial dimensions. If this replacement cannot be done (e.g. the pre-transformation loops have been fused/split/reordered), then the follow-up stage is inserted. - Pro: Always valid to apply a transform - Pro: Avoids undoing scheduling benefits from previous changes to iterators. - Con: Allowing the follow-up stage at all requires just as much analysis to identify as if it were always present. 4. When a buffer is transformed, all loops over spatial dimensions in the producer are replaced with loops over the post-tranformation buffer axes. - Pro: Always valid to apply a transform. - Con: May undo scheduling that has previously provided useful performance improvements. - Con: Loop iterators over pre-transformation indices may have been fused with reduction axes. Would need to undo the fusion to apply. The current proposed version would be option 4, but I think I'd prefer option 2 in order to reduce the number of follow-up simplifications required. > Some of the complications of duplicated condition(and their simplification) > roots from the fact that we do layout transform of output and input > separately(each introducing their own conditions which then needs to be > simplified). It might be helpful to do a global transformation, usually > driven from the output, then "backprop" the implication of that decisions to > the input. Doing such transformation at a single shot will likely alleviate > the need of generating extra conditions then simplifying them. At the TIR level, I suppose I'm unclear on what "'backprop' the implication of that decisions to the input" would mean, since changing the layout of one buffer doesn't strictly require changing the layout of other buffers. Intuitively, I can picture how it would apply to some operators (e.g. perform analogous transformations on the inputs to element-wise functions) and how those could be identified (e.g. track which indices are used for access of each buffer, and identify corresponding shapes from the indices), but I'm unclear as to how a similar intuition would be applied for more complicated functions. (I'm also not sure if this would require a similarly difficult sequence of proofs as the proposed transforms, just with the goal of proving a preferred layout rather than proving a possible simplification.) We could allow the user to specify transformations of all buffers simultaneously, but this wouldn't really solve the problem, as the simplifications made would still need to be based on that information provided. At the graph level, I don't think a single direction of constraint propagation is sufficient. Backward propagation, starting with the output values returned to the user, could track which indices contribute to that final output, which could be exposed to producers. Forward propagation, starting with the input values provided by the user, could track which indices of intermediate buffers contain known values, which could be exposed to consumers. With these uncertainties, I'm starting to think of `layout_transform` and `pad_value` not as a complete end-to-end handling in itself, but providing a platform on which the graph-level reasoning can be built. That is, it doesn't itself perform the graph-level reasoning, but can accept the layout/padding requirements given from graph-level reasoning. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1171290053 You are receiving this because you are subscribed to this thread. Message ID: <apache/tvm-rfcs/pull/77/c1171290...@github.com>