Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Chris Sullivan Sun, 12 Jun 2022 09:35:01 -0700

Thanks for sharing the contextual pointers for the community @vinx13. Agreed 
the approaches discussed are both valid. I would actually like to argue the 
stronger point that they are complimentary and are only appearing to be 
contrary because we are considering too narrow of a scope.

It can be helpful to share an overview of common handlings of layout
transformations in ML compilers. Most of my arguments against A0 (local
cancellation in the graph only) being sufficient stem from prior experiences in
graph layout optimizations. The evolution of the graph approaches for
optimizing layouts I've seen followed the below trajectory:

1) The first graph approach that is often taken is what has been argued so far
in A0, local back to back cancellation. It works reasonably when the data flow
and op variety in a model are simple.

Local-only cancellation tends to fail in models which still have simple data
flow, but more variety in the sequence of operators, each with different valid
implementations. Consider,

```
-> transformX -> conv2d -> (inv_transform_X) -> pool -> (transformX) -> conv2d
-> inv_transform_X
```
In this case `pool` can be replaced by any sequence of operations that are
layout agnostic or for which there exists multiple implementations, and so the
choice of layout is unconstrained. In this case these operators are layout
unconstrained, whereas the convolutions are layout constrained. As you can see
even for a simple model, the approach discussed in A0 already needs to be
modified to support non-local layout analysis and folding.

2) The typical second approach is then to still utilize A0, but to first apply
a pre-processing pass that sinks layout transforming operations in the graph
along the path of data flow
[[1](https://github.com/pytorch/glow/blob/56249602c9ec93fa586cea2ce8ab315003478eed/lib/Optimizer/GraphOptimizerPipeline/FunctionPassPipeline.cpp#L69),

[2](https://github.com/NervanaSystems/ngraph/blob/f677a119765ca30636cf407009dabd118664951f/src/ngraph/pass/reshape_sinking.cpp#L542)].
The above case then becomes,

```
-> transformX -> conv2d -> pool -> (inv_transform_X -> transformX) -> conv2d
-> inv_transform_X
```
Then apply the method discussed in A0 and do local cancellation.

The above method works well for models with relatively simple data flow but for
models with more branching the method has limitations. A simple consideration
is sinking a transform through an operation with multiple inputs. The process
of doing so requires materialization of the inverse transform on the other
operands.

For the sake of simplicity consider matrix multiplication: ${A^{T}}B =
{(B^{T}A)}^T$, in this case the final state of sinking the transpose on A was
to materialize two transposes rather than one, one on B and one on the matmul.
Sinking-alone isn't sufficient to guarantee a globally optimal layout because
it still only treats the propagation of transforms locally/greedily.

3) A modification to sinking (downward along data flow) is to introduce upward
flowing
[[3](https://github.com/NervanaSystems/ngraph/blob/f677a119765ca30636cf407009dabd118664951f/src/ngraph/pass/reshape_sinking.cpp#L156)].
It can help by flowing transforms along the poisoned operands (e.g. B in the
above matrix multiply) by propagating the transform up as far as possible,
hopefully to a constant where it can be folded.

For inference graphs I've seen this approach work well. But the approach is
still greedy and suboptimal choices can occur. For training graphs this
approach works less well due to the data flow complexity involved with
branching from the forward to backward graph and the optimizers in place update
of weights. I omit a specific example in this case for brevity, but encourage
the review of of the graphs from @t-vi application of TVM to pytorch training
for Bert and the long chains of transpose and reshapes that occur within the
forward and backward m-h attention layers
[[3](https://github.com/apache/tvm-site/blob/85c7e4ebf6d9ed221075e38e5e5e1a0052693acc/_posts/2020-07-14-bert-pytorch-tvm.md)].

4) Finally, to arrive at a closer to globally optimal solution for layout,
different constraint-based approaches are considered. Constraints from
operations which are layout constrained can be flowed across unconstrained
parts of the graph until an approximate global optimum is reached.

An example implementation I have seen included layout sources (e.g. operators
like conv2d on an NPU with distinct layout constraints) and layout sinks (e.g.
operations which involve data movement by DMA engines or in-memory compute
which allow zero-cost data layout rearrangement during store). A constraint
solver in this case flows layout constraints from sources toward sinks that can
absorb aggregated/merged layout transform constraints.
____

Coming back to the present discussion, I believe our design should be focused
on ensuring that one or more of the non-local approaches discussed above in 2-4
are achievable. Any of these cases require the following components:

C0) The ability to track constraints on a buffer.

C1) The ability to roundtrip between an IR representation and the
producer/consumer constraint representations.

C2) The ability to merge/fold constraints - flowing is just merging a
constraint with an unconstraint.

Even for the pure local (back-to-back) case discussed in A0, components C1 and
C2 are helpful with the caveat that the inferred constraints from the IR only
exists within the local context of a single producer consumer pair in a pass.

Thus both A0 and A1 can benefit from these components, and the delta that
exists between A0 and A1 is clearer:

* Delta 1: In A1 buffer constraints are maintained per buffer in the graph
globally (non-local); and therefore can be optimized by any of the methods 2-4
discussed.

* Delta 2: In addition to inferring buffer constraints from IR (one half of
C1), A1 proposes for constraint expression about the memory during scheduling
to be maintained for some time.

--
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1153227651
You are receiving this because you are subscribed to this thread.

Message ID: <apache/tvm-rfcs/pull/77/c1153227...@github.com>

Re: [apache/tvm-rfcs] [RFC] Buffer Layout Padding (PR #77)

Reply via email to