> Talking about “constraints”, it is also useful to talk about categories of > them, roughly we can divide them into three categories.
I like this breakdown, and agree. In this categorization, what I've been calling "constraints" would be "assumptions". Double-checking in `builtin.h`, it looks like we don't currently have a TIR equivalent of `__builtin_assume`. For usage of assumptions, I think the key would be to insert an assumption whenever the information that could otherwise prove it is hoisted out of the PrimFunc. That would provide non-local information that could be used by the PrimFunc to allow local simplifications. > transformation of PrimFunc do not change the PrimFunc interface: this is > really important so we can transform a PrimFunc without worrying about how > the graph interacts with it(as the interface remains the same, we can lift > out the blocks earlier) I don't think we can make this strong of a statement, as it would also forbid fusing operators together or hoisting a stage out of a PrimFunc. In both cases, the signature of the resulting PrimFunc may be different than it was before. This shows up in the example, as the interface of `grow` is different from the transformed `grow_packed`. As a slightly less general statement, I would say that transformations of a PrimFunc *in isolation* may not change the PrimFunc's interface. So an optimization search to improve the performance of a single subgraph may not change the layout of its own arguments, nor may it change assumptions of what is present in the padding, as those would change its interface. However, a graph-level transform would be allowed to fuse subgraphs, to hoist stages out of a PrimFunc, to alter the layout of a PrimFunc's input, or to alter the assumptions provided about the inputs. In general, a PrimFunc's interface could only be changed when calls into the PrimFunc are also modified to remain compatible. Is there a better term than "scheduling primitive" to describe layout transformations that impact input/output buffers? I think the difference is between context-independent transformations that may be performed on a PrimFunc without changing, as opposed to context-dependent transformations that may only be performed as part of a graph-level transformation. > Each function needs to have its own TIR analysis of how it flows things back, > for example, in the case of `addone`, we can safely flow PadMapping back, > changing `addone` to `addone_packed` by analyzing the TIR. If the addone is > elemwise exp however, we need to insert a select operator(because `exp(0)=1` > ) the message to input becomes `PadMapping(constraint, pad_value=undef)`. Would this handle cases where there are multiple different options for how an operator could be implemented? Otherwise, I'm not sure how this would handle cases where multiple different sets of layouts/constraints could be inferred from different TIR-level schedules of the same operator. As examples, the drop-down has 6 different implementations of `addone`, each of which would allow different hoistable pad/crop operations. <details> <summary>Click to expand</summary> <br> ```python # Implementation 1, no preproc/postproc are present. # # No hoistable layout transformations. Could be fused with a layout # transformation, but doesn't otherwise provide any constraints. @T.prim_func def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]): for i in T.serial(14): with T.block("compute"): B[i] = A[i] + 1 # Implementation 2, pad input/output, but never access the padding of # either input or output. # # In back-propagation of constraints, the T.undef() that is cropped # from BC could be narrowed to a known value provided from the # successor. However, AC's padding is never written to, so could # propagate T.undef() back to preceding function. @T.prim_func def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]): for io, ii in T.grid(4, 4): with T.block(): T.block_attr("preproc", "pad") if 4 * io + ii < 14: AC[io, ii] = A[4 * io + ii] for i in T.serial(14): with T.block("compute"): BC[i // 4, i % 4] = AC[i // 4, i % 4] + 1 for i in T.serial(14): with T.block(): T.block_attr("postproc", ["crop", T.undef()]) B[i] = BC[i // 4, i % 4] # Implementation 3, pad input with known value, but never access # padding of output. # # In back-propagation of constraints, the T.undef() that is cropped # from BC could be narrowed to a known value provided from the # successor. AC's padding is written to, so this would propagate # `PadMapping(predicate, pad_value=0)` to the previous operator. @T.prim_func def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]): for io, ii in T.grid(4, 4): with T.block(): T.block_attr("preproc", "pad") AC[io, ii] = T.if_then_else(4 * io + ii < 14, A[4 * io + ii], 0) for i in T.serial(14): with T.block("compute"): BC[i // 4, i % 4] = AC[i // 4, i % 4] + 1 for i in T.serial(14): with T.block(): T.block_attr("postproc", ["crop", T.undef()]) B[i] = BC[i // 4, i % 4] # Implementation 4, pad input with arbitrary value, provide no # guarantees in output. # # In back-propagation of constraints, the T.undef() that is cropped # from BC could be narrowed to a known value provided from the # successor. AC's padding is written to, so this would propagate # `PadMapping(predicate, pad_value=BC_pad_value - 1)` to the # previous operator. @T.prim_func def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]): for io, ii in T.grid(4, 4): with T.block(): T.block_attr("preproc", "pad") AC[io, ii] = T.if_then_else(4 * io + ii < 14, A[4 * io + ii], T.undef()) for io, ii in T.grid(4, 4): with T.block("compute"): BC[io, ii] = AC[io, ii] + 1 for i in T.serial(14): with T.block(): T.block_attr("postproc", ["crop", T.undef()]) B[i] = BC[i // 4, i % 4] # Implementation 5, pad input with known value, analysis of TIR # successfully propagates pad value through to provide assumption when # cropping. # # In back-propagation of constraints, the output assumption is fixed. # Unless the operator following addone has included the constraint 1 # as the required value in its padding, the crop/pad pair wouldn't be # able to be removed. AC's padding is written to, and would propagate # `PadMapping(predicate, pad_value=0)` to the previous operator. @T.prim_func def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]): for io, ii in T.grid(4, 4): with T.block(): T.block_attr("preproc", "pad") AC[io, ii] = T.if_then_else(4 * io + ii < 14, A[4 * io + ii], 0) for io, ii in T.grid(4, 4): with T.block("compute"): BC[io, ii] = AC[io, ii] + 1 for i in T.serial(14): with T.block(): T.block_attr("postproc", ["crop", 1]) B[i] = BC[i // 4, i % 4] # Implementation 6, pad input with known value, analysis of TIR can't # successfully propagate pad value through to the output. # # In back-propagation of constraints, the output assumption is fixed. # Since we don't provide an assumption of what will be returned, the # graph-level pair of `crop(T.undef())` followed by `pad(x)` could # only be canceled out if `x` is `T.undef()`. AC's padding is written # to, and would propagate `PadMapping(predicate, pad_value=0)` to # the previous operator. @T.prim_func def addone(A: T.Buffer[14, "int32"], B: T.Buffer[14, "int32"]): for io, ii in T.grid(4, 4): with T.block(): T.block_attr("preproc", "pad") AC[io, ii] = T.if_then_else(4 * io + ii < 14, A[4 * io + ii], 0) for io, ii in T.grid(4, 4): with T.block("compute"): BC[io, ii] = AC[io, ii] + 1 for i in T.serial(14): with T.block(): T.block_attr("postproc", ["crop", T.undef()]) B[i] = BC[i // 4, i % 4] ``` </details> I think the main change is that the temporary stages with annotation will need to allow multiple possibilities, rather than a single definitive layout. These options could then be searched at the graph-level to decide on the appropriate layout. After that is decided, the tempoerary stage could be selected and the transformations hoisted. > But extra amount of care is needed when we attempt to move > `crop_with_pad_assume`, as it really depends on the value property of its > input. Completely agreed. I think this is true at both the TIR and graph levels, that allowing assumptions means ensuring that the assumption isn't changed after it is used for simplifications. The advantage of writing the assumptions at the graph level is that specific pairs of functions (such as `crop_with_pad_assume(pad_value)` followed by `pad_with_value(pad_value)`) can be identified as no-ops, without needing a full proof of it. I think the main rules that would need to be followed when handling assumptions would be the following three. 1. An assumption may be inserted wherever it can be statically proven, or asserted by a user about user-supplied input. 2. An assumption may be removed only if it can be statically proven. Assertions from a user about user-supplied input may never be removed, as they may have already been used to perform irreversible simplifications. 3. Static provers must reset all assumptions about a variable when `T.undef()` is assigned to it, even though these assignments are removed during lowering. The restriction against changing a PrimFunc's interface fall out directly from rule #1. Since an assumption that restrict values of an input cannot be proven, these assumptions may not be modified. -- Reply to this email directly or view it on GitHub: https://github.com/apache/tvm-rfcs/pull/77#issuecomment-1165713753 You are receiving this because you are subscribed to this thread. Message ID: <apache/tvm-rfcs/pull/77/c1165713...@github.com>