Thanks for sharing the contextual pointers for the community @vinx13. Agreed the approaches discussed are both valid. I would actually like to argue the stronger point that they are complimentary and are only appearing to be contrary because we are considering too narrow of a scope.
It can be helpful to share an overview of common handlings of layout transformations in ML compilers. Most of my arguments against A0 (local cancellation in the graph only) being sufficient stem from prior experiences in graph layout optimizations. The evolution of the graph approaches for optimizing layouts I've seen followed the below trajectory: 1) The first graph approach that is often taken is what has been argued so far in A0, local back to back cancellation. It works reasonably when the data flow and op variety in a model are simple. Local-only cancellation tends to fail in models which still have simple data flow, but more variety in the sequence of operators, each with different valid implementations. Consider, ``` -> transformX -> conv2d -> (inv_transform_X) -> pool -> (transformX) -> conv2d -> inv_transform_X ``` In this case `pool` can be replaced by any sequence of operations that are layout agnostic or for which there exists multiple implementations, and so the choice of layout is unconstrained. In this case these operators are layout unconstrained, whereas the convolutions are layout constrained. As you can see even for a simple model, the approach discussed in A0 already needs to be modified to support non-local layout analysis and folding. 2) The typical second approach is then to still utilize A0, but to first apply a pre-processing pass that sinks layout transforming operations in the graph along the path of data flow [[1](https://github.com/pytorch/glow/blob/56249602c9ec93fa586cea2ce8ab315003478eed/lib/Optimizer/GraphOptimizerPipeline/FunctionPassPipeline.cpp#L69), [2](https://github.com/NervanaSystems/ngraph/blob/f677a119765ca30636cf407009dabd118664951f/src/ngraph/pass/reshape_sinking.cpp#L542)]. The above case then becomes, ``` -> transformX -> conv2d -> pool -> (inv_transform_X -> transformX) -> conv2d -> inv_transform_X ``` Then apply the method discussed in A0 and do local cancellation. The above method works well for models with relatively simple data flow but for models with more branching the method has limitations. A simple consideration is sinking a transform through an operation with multiple inputs. The process of doing so requires materialization of the inverse transform on the other operands. For the sake of simplicity consider matrix multiplication: ${A^{T}}B = {(B^{T}A)}^T$, in this case the final state of sinking the transpose on A was to materialize two transposes rather than one, one on B and one on the matmul. Sinking-alone isn't sufficient to guarantee a globally optimal layout because it still only treats the propagation of transforms locally/greedily. 3) A modification to sinking (downward along data flow) is to introduce upward flowing [[3](https://github.com/NervanaSystems/ngraph/blob/f677a119765ca30636cf407009dabd118664951f/src/ngraph/pass/reshape_sinking.cpp#L156)]. It can help by flowing transforms along the poisoned operands (e.g. B in the above matrix multiply) by propagating the transform up as far as possible, hopefully to a constant where it can be folded. For inference graphs I've seen this approach work well. But the approach is still greedy and suboptimal choices can occur. For training graphs this approach works less well due to the data flow complexity involved with branching from the forward to backward graph and the optimizers in place update of weights. I omit a specific example in this case for brevity, but encourage the review of of the graphs from @t-vi application of TVM to pytorch training for Bert and the long chains of transpose and reshapes that occur within the forward and backward m-h attention layers [[3](https://github.com/apache/tvm-site/blob/85c7e4ebf6d9ed221075e38e5e5e1a0052693acc/_posts/2020-07-14-bert-pytorch-tvm.md)]. 4) Finally, to arrive at a closer to globally optimal solution for layout, different constraint-based approaches are considered. Constraints from operations which are layout constrained can be flowed across unconstrained parts of the graph until an approximate global optimum is reached. An example implementation I have seen included layout sources (e.g. operators like conv2d on an NPU with distinct layout constraints) and layout sinks (e.g. operations which involve data movement by DMA engines or in-memory compute which allow zero-cost data layout rearrangement during store). A constraint solver in this case flows layout constraints from sources toward sinks that can absorb aggregated/merged layout transform constraints. ____ Coming back to the present discussion, I believe our design should be focused on ensuring that one or more of the non-local approaches discussed above in 2-4 are achievable. Any of these cases require the following components: C0) The ability to track constraints on a buffer. C1) The ability to roundtrip between an IR representation and the producer/consumer constraint representations. C2) The ability to merge/fold constraints - flowing is just merging a constraint with an unconstraint. Even for the pure local (back-to-back) case discussed in A0, components C1 and C2 are helpful with the caveat that the inferred constraints from the IR only exists within the local context of a single producer consumer pair in a pass. Thus both A0 and A1 can benefit from these components, and the delta that exists between A0 and A1 is clearer: * Delta 1: In A1 buffer constraints are maintained per buffer in the graph globally (non-local); and therefore can be optimized by any of the methods 2-4 discussed. * Delta 2: In addition to inferring buffer constraints from IR (one half of C1), A1 proposes for constraint expression about the memory during scheduling to be maintained for some time. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1153227651 You are receiving this because you are subscribed to this thread. Message ID: <apache/tvm-rfcs/pull/77/c1153227...@github.com>