> In general it is helpful to first keep schedule decision local, e.g. 
> introducing a caching stage (AC, BC in the example), the compose with another 
> reflowing pass to bring the decision to consumer/producers.

My goal with the latest update wasn't to require global decisions, but to make 
local changes only, which could be used in different contexts.  For the 
auto-scheduler, since the context requires maintaining the same PrimFunc 
interface, local optimization would be restricted to transformations of the 
caching stage.  For stand-alone usage, such as preparing a single PrimFunc for 
a unit test, the context allows the interface to change.  That way, the 
restrictions to the transformations are imposed by the level of abstraction 
that requires them.

> While it is possible to express padding with a loop and another loop that 
> writes the padded value, it is harder to schedule the resulting blocks as 
> there are more than one producers. Having a single loop and use 
> `T.if_then_else` will express such pattern in a single shot and makes future 
> rewriting easier.

I definitely agree that this makes the later analysis/rewrites easier. I had 
maintained them as two separate loops both to minimize the extent of changes 
being made in any one scheduling change, and to maintain the current behavior 
of `Schedule.transform_layout` which does not alter the surrounding loop 
iterators ([previous 
conversation](https://github.com/apache/tvm/pull/10538#discussion_r826209815) 
with @vinx13).

I see four main options on how the loopnests could be handled:

1. When a buffer is transformed, all loops in the producer stage over the 
buffers's pre-transformation axes are replaced with loops over the buffer's 
post-transformation spatial dimensions.  It is an error if this replacement 
cannot be done (e.g. the pre-transformation loops have been 
fused/split/reordered).

   - Pro: Allows the `T.if_then_else` to be inserted at the time of the 
transformations.
   - Pro: Removes the need for the .
   - Con: May restrict search space, since earlier manipulation of the loop 
iterators would prevent later buffer transformations.
   - Con: Doesn't help consumers of a transformed buffer.  In a reduction, may 
be desirable to iterate over the input buffer, but this couldn't be expressed 
in terms of an output.
   - Con: For buffers whose padding is not written to, must either insert a 
conditional statement or maintain the pre-transformation loop structure.

2. When a buffer is transformed, an attempt is made to replace all loops in the 
producer stage over the buffers's pre-transformation axes with loops over the 
buffer's post-transformation spatial dimensions.  If this replacement cannot be 
done (e.g. the pre-transformation loops have been fused/split/reordered), and 
if `pad_value` is not `None`, then an error should be raised.

   - Pro: Always valid to apply a transform
   - Pro: Avoids undoing scheduling benefits from previous changes to iterators.
   - Pro: Later calls to `reduce_branching_through_overcompute` could still 
introduce a value for the padding, if the full life cycle of the buffer is 
known.
   - Con: Allowing the follow-up stage at all requires just as much analysis to 
identify as if it were always present.

3. When a buffer is transformed, all loops in the producer stage over the 
buffers's pre-transformation axes are replaced with loops over the buffer's 
post-transformation spatial dimensions.  If this replacement cannot be done 
(e.g. the pre-transformation loops have been fused/split/reordered), then the 
follow-up stage is inserted.

   - Pro: Always valid to apply a transform
   - Pro: Avoids undoing scheduling benefits from previous changes to iterators.
   - Con: Allowing the follow-up stage at all requires just as much analysis to 
identify as if it were always present.

4. When a buffer is transformed, all loops over spatial dimensions in the 
producer are replaced with loops over the post-tranformation buffer axes.

   - Pro: Always valid to apply a transform.
   - Con: May undo scheduling that has previously provided useful performance 
improvements.
   - Con: Loop iterators over pre-transformation indices may have been fused 
with reduction axes.  Would need to undo the fusion to apply.
     
The current proposed version would be option 4, but I think I'd prefer option 2 
in order to reduce the number of follow-up simplifications required.

> Some of the complications of duplicated condition(and their simplification) 
> roots from the fact that we do layout transform of output and input 
> separately(each introducing their own conditions which then needs to be 
> simplified). It might be helpful to do a global transformation, usually 
> driven from the output, then "backprop" the implication of that decisions to 
> the input. Doing such transformation at a single shot will likely alleviate 
> the need of generating extra conditions then simplifying them.

At the TIR level, I suppose I'm unclear on what "'backprop' the implication of 
that decisions to the input" would mean, since changing the layout of one 
buffer doesn't strictly require changing the layout of other buffers.  
Intuitively, I can picture how it would apply to some operators (e.g. perform 
analogous transformations on the inputs to element-wise functions) and how 
those could be identified (e.g. track which indices are used for access of each 
buffer, and identify corresponding shapes from the indices), but I'm unclear as 
to how a similar intuition would be applied for more complicated functions.  
(I'm also not sure if this would require a similarly difficult sequence of 
proofs as the proposed transforms, just with the goal of proving a preferred 
layout rather than proving a possible simplification.)

We could allow the user to specify transformations of all buffers 
simultaneously, but this wouldn't really solve the problem, as the 
simplifications made would still need to be based on that information provided.

At the graph level, I don't think a single direction of constraint propagation 
is sufficient.  Backward propagation, starting with the output values returned 
to the user, could track which indices contribute to that final output, which 
could be exposed to producers. Forward propagation, starting with the input 
values provided by the user, could track which indices of intermediate buffers 
contain known values, which could be exposed to consumers.

With these uncertainties, I'm starting to think of `layout_transform` and 
`pad_value` not as a complete end-to-end handling in itself, but providing a 
platform on which the graph-level reasoning can be built. That is, it doesn't 
itself perform the graph-level reasoning, but can accept the layout/padding 
requirements given from graph-level reasoning.


-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1171290053
You are receiving this because you are subscribed to this thread.

Message ID: <apache/tvm-rfcs/pull/77/c1171290...@github.com>

Reply via email to