Thanks @csullivan for providing the overview. I agree that non-local approaches 
2-4 are necessary. From the examples in this RFC I can also see how the 
components C0-C2 can be used to support these non-local approaches. C0 + C1 
allows to specify the constraints during scheduling, and propagate back to the 
graph. Besides them, I would also like to mention another component 
* C3: ability to specify constraints for each operator.

It seems to me that C0, C1, C3 are actually choices of implementation as there 
are multiple ways that require a combination of them to achieve the goal of 
constraint flowing.
* C0 + C1 (which imply C3 are satisfied) suggests implementing the constraints 
at TIR level using `BufferConstraint`. To propagate back the constraints to the 
graph, which is `Tensor` central, it seems the graph-level counterpart of 
`BufferContraints` is not clear, as @wrongtest mentioned.
* C3 is also feasible purely in the graph, which requires some mechanism to 
register per-operator constraints. An example I came up with is each operator 
can have a list of supported layout, and the constraint solver can choose 
layout for each operator to approximate the global optimum for the graph. This 
satisfies the need for non-local approaches but doesn't need TIR level 
constraints. Padding, instead of explicitly inserting `transform` / 
`inv_transform`,  is also achievable as graph-level constraint flowing.

Back to the discussion of this RFC, I think the main comments about the 
proposed methods is IR changes required (which may have greater impacts on the 
existing TIR and scheduling), and the complexity involved using the new 
schedule primitive to reach the final desired state. From my understanding, the 
intention of these new primitives is to allow arithmetic simplification to 
perform graph rewriting like over-computation. If this can be achieved as 
graph-level rewriting rule (perhaps simpler as it doesn't need arithmetic 
manipulations), personally I think that would still be preferred for better 
maintainability. Also I'd like to mention that modeling such rewriting in the 
graph doesn't necessary tie the TIR operator with a specific graph IR 
implementation. As we are moving to S-TIR scheduling, it is easy to apply some 
preprocessing steps to derive the PrimFunc in specific layout from a standard 
`te.compute` definition.

Finally, I would like to encourage us to focus on the e2e goals. It seems the 
current approaches, either implemented as A0 or A1 in graph-level, should 
suffice the use cases in the inference graph. Though the training graph is 
probably not an immediate need, if we would like to consider their use cases, 
probably having some concrete examples with desired result can guide us to make 
better decision.


-- 
Reply to this email directly or view it on GitHub:
https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1155766862
You are receiving this because you are subscribed to this thread.

Message ID: <apache/tvm-rfcs/pull/77/c1155766...@github.com>

Reply via email to