The implementation of the hardware is not of interest to the high-level Relay program all Tensor to Tensor functions are black box. They can be implemented anyway you want, in C++, in TVM, or as a hardware accelerator primitive. If you want to map a subset of the program down to this hardware you will have to unroll it, which is required by most fixed-function hardware. You can then replace the unrolled program as a new operation and rewrite the program to use this instead.
Hardware that does not support arbitrary & dynamic program sizes can not execute all models of interest, they fundamentally don't fit into Halide/TVM style DSLs. The deep learning community has focused on optimizing for small subset of models with very regular behavior, but the next wave of models invalidates many assumptions, such as statically known dimensions or static control-flow required by polyhedral optimizers. The point of this VM is to coordinate at the higher level where you need iteration, dynamic allocation, and communication. I have thought further about a register based VM and see no strong argument for why registers are better than stacks. Most of the research on dynamic VMs focus on this distinction in order to reduce memory movement and dispatch overhead while executing the application. Packed functions *will* dominate execution time and optimizing for dispatch is an incredibly premature optimization. The other argument for register based VMs is instruction level parallelism. Again instructions don't matter much here, meaningful parallelism is happening at data dependencies between operators, and inside the operators themselves (i.e parallel matrix mul). The point of the parallel monad approach is not to add it to the source language, but that the execution technique is valid way for us to get parallelism between operator invocations. We can view the future graph as the data dependency graph and do graph reduction over it. For example if I depend on a sequence of function calls it is valid to evaluate them in parallel while evaluating a future computation that may depend on its result. The amount of synchronization needed here is very small, and again the real opportunity for parallelism is inside operators in the tensor to tensor functions. We don't need to worry about where the results are stored, we essentially give it a register name when we push a future into stack position `n`. In a sense we already have an infinite register because we can address any stack position. In this case we can easily address a future result by referencing position `n`. The only difference is the location where operations look for their result. We need a call stack for functions, and function calls are the primary operation based on observations of current workloads. Furthermore the current approach makes the VMCompiler far simpler, and easier to extend. I personally value simplicity, we have *zero* evidence that the current approach is slow, in fact we have evidence of the contrary. The initial prototype is already faster than MxNet's executor which is used in production at AWS. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/dmlc/tvm/issues/2810#issuecomment-476019507