Thanks for sharing the contextual pointers for the community @vinx13. Agreed 
the approaches discussed are both valid. I would actually like to argue the 
stronger point that they are complimentary and are only appearing to be 
contrary because we are considering too narrow of a scope. 

It can be helpful to share an overview of common handlings of layout 
transformations in ML compilers. Most of my arguments against A0 (local 
cancellation in the graph only) being sufficient stem from prior experiences in 
graph layout optimizations. The evolution of the graph approaches for 
optimizing layouts I've seen followed the below trajectory:
 
1) The first graph approach that is often taken is what has been argued so far 
in A0, local back to back cancellation. It works reasonably when the data flow 
and op variety in a model are simple.

Local-only cancellation tends to fail in models which still have simple data 
flow, but more variety in the sequence of operators, each with different valid 
implementations. Consider, 

```
-> transformX -> conv2d -> (inv_transform_X) -> pool -> (transformX) -> conv2d 
-> inv_transform_X
```
In this case `pool` can be replaced by any sequence of operations that are 
layout agnostic or for which there exists multiple implementations, and so the 
choice of layout is unconstrained. In this case these operators are layout 
unconstrained, whereas the convolutions are layout constrained. As you can see 
even for a simple model, the approach discussed in A0 already needs to be 
modified to support non-local layout analysis and folding. 

2) The typical second approach is then to still utilize A0, but to first apply 
a pre-processing pass that sinks layout transforming operations in the graph 
along the path of data flow 
[[1](https://github.com/pytorch/glow/blob/56249602c9ec93fa586cea2ce8ab315003478eed/lib/Optimizer/GraphOptimizerPipeline/FunctionPassPipeline.cpp#L69),
 
[2](https://github.com/NervanaSystems/ngraph/blob/f677a119765ca30636cf407009dabd118664951f/src/ngraph/pass/reshape_sinking.cpp#L542)].
 The above case then becomes, 

```
-> transformX -> conv2d -> pool ->  (inv_transform_X -> transformX) -> conv2d 
-> inv_transform_X
```
Then apply the method discussed in A0 and do local cancellation. 

The above method works well for models with relatively simple data flow but for 
models with more branching the method has limitations. A simple consideration 
is sinking a transform through an operation with multiple inputs. The process 
of doing so requires materialization of the inverse transform on the other 
operands. 

For the sake of simplicity consider matrix multiplication: ${A^{T}}B = 
{(B^{T}A)}^T$, in this case the final state of sinking the transpose on A was 
to materialize two transposes rather than one, one on B and one on the matmul. 
Sinking-alone isn't sufficient to guarantee a globally optimal layout because 
it still only treats the propagation of transforms locally/greedily.

3) A modification to sinking (downward along data flow) is to introduce upward 
flowing 
[[3](https://github.com/NervanaSystems/ngraph/blob/f677a119765ca30636cf407009dabd118664951f/src/ngraph/pass/reshape_sinking.cpp#L156)].
 It can help by flowing transforms along the poisoned operands (e.g. B in the 
above matrix multiply) by propagating the transform up as far as possible, 
hopefully to a constant where it can be folded. 

For inference graphs I've seen this approach work well. But the approach is 
still greedy and suboptimal choices can occur. For training graphs this 
approach works less well due to the data flow complexity involved with 
branching from the forward to backward graph and the optimizers in place update 
of weights. I omit a specific example in this case for brevity, but encourage 
the review of of the graphs from @t-vi application of TVM to pytorch training 
for Bert and the long chains of transpose and reshapes that occur within the 
forward and backward m-h attention layers 
[[3](https://github.com/apache/tvm-site/blob/85c7e4ebf6d9ed221075e38e5e5e1a0052693acc/_posts/2020-07-14-bert-pytorch-tvm.md)].
 

4) Finally, to arrive at a closer to globally optimal solution for layout, 
different constraint-based approaches are considered. Constraints from 
operations which are layout constrained can be flowed across unconstrained 
parts of the graph until an approximate global optimum is reached. 

An example implementation I have seen included layout sources (e.g. operators 
like conv2d on an NPU with distinct layout constraints) and layout sinks (e.g. 
operations which involve data movement by DMA engines or in-memory compute 
which allow zero-cost data layout rearrangement during store). A constraint 
solver in this case flows layout constraints from sources toward sinks that can 
absorb aggregated/merged layout transform constraints. 
____

Coming back to the present discussion, I believe our design should be focused 
on ensuring that one or more of the non-local approaches discussed above in 2-4 
are achievable. Any of these cases require the following components:

C0) The ability to track constraints on a buffer.

C1) The ability to roundtrip between an IR representation and the 
producer/consumer constraint representations.

C2) The ability to merge/fold constraints - flowing is just merging a 
constraint with an unconstraint. 

Even for the pure local (back-to-back) case discussed in A0, components C1 and 
C2 are helpful with the caveat that the inferred constraints from the IR only 
exists within the local context of a single producer consumer pair in a pass. 

Thus both A0 and A1 can benefit from these components, and the delta that 
exists between A0 and A1 is clearer:

* Delta 1: In A1 buffer constraints are maintained per buffer in the graph 
globally (non-local); and therefore can be optimized by any of the methods 2-4 
discussed. 

* Delta 2: In addition to inferring buffer constraints from IR (one half of 
C1), A1 proposes for constraint expression about the memory during scheduling 
to be maintained for some time.





-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1153227651
You are receiving this because you are subscribed to this thread.

Message ID: <apache/tvm-rfcs/pull/77/c1153227...@github.com>

Reply via email to