Thanks @csullivan for providing the overview. I agree that non-local approaches 2-4 are necessary. From the examples in this RFC I can also see how the components C0-C2 can be used to support these non-local approaches. C0 + C1 allows to specify the constraints during scheduling, and propagate back to the graph. Besides them, I would also like to mention another component * C3: ability to specify constraints for each operator.
It seems to me that C0, C1, C3 are actually choices of implementation as there are multiple ways that require a combination of them to achieve the goal of constraint flowing. * C0 + C1 (which imply C3 are satisfied) suggests implementing the constraints at TIR level using `BufferConstraint`. To propagate back the constraints to the graph, which is `Tensor` central, it seems the graph-level counterpart of `BufferContraints` is not clear, as @wrongtest mentioned. * C3 is also feasible purely in the graph, which requires some mechanism to register per-operator constraints. An example I came up with is each operator can have a list of supported layout, and the constraint solver can choose layout for each operator to approximate the global optimum for the graph. This satisfies the need for non-local approaches but doesn't need TIR level constraints. Padding, instead of explicitly inserting `transform` / `inv_transform`, is also achievable as graph-level constraint flowing. Back to the discussion of this RFC, I think the main comments about the proposed methods is IR changes required (which may have greater impacts on the existing TIR and scheduling), and the complexity involved using the new schedule primitive to reach the final desired state. From my understanding, the intention of these new primitives is to allow arithmetic simplification to perform graph rewriting like over-computation. If this can be achieved as graph-level rewriting rule (perhaps simpler as it doesn't need arithmetic manipulations), personally I think that would still be preferred for better maintainability. Also I'd like to mention that modeling such rewriting in the graph doesn't necessary tie the TIR operator with a specific graph IR implementation. As we are moving to S-TIR scheduling, it is easy to apply some preprocessing steps to derive the PrimFunc in specific layout from a standard `te.compute` definition. Finally, I would like to encourage us to focus on the e2e goals. It seems the current approaches, either implemented as A0 or A1 in graph-level, should suffice the use cases in the inference graph. Though the training graph is probably not an immediate need, if we would like to consider their use cases, probably having some concrete examples with desired result can guide us to make better decision. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1155766862 You are receiving this because you are subscribed to this thread. Message ID: <apache/tvm-rfcs/pull/77/c1155766...@github.com>