The implementation of the hardware is not of interest to the high-level Relay 
program all Tensor to Tensor functions are black box. They can be implemented 
anyway you want, in C++, in TVM, or as a hardware accelerator primitive. If you 
want to map a subset of the program down to this hardware you will have to 
unroll it, which is required by most fixed-function hardware. You can then 
replace the unrolled program as a new operation and rewrite the program to use 
this instead. 

Hardware that does not support arbitrary & dynamic program sizes can not 
execute all models of interest, they fundamentally don't fit into Halide/TVM 
style DSLs. The deep learning community has focused on optimizing for  small 
subset of models with very regular behavior, but the next wave of models 
invalidates many assumptions, such as statically known dimensions or static 
control-flow required by polyhedral optimizers. The point of this VM is to 
coordinate at the higher level where you need iteration, dynamic allocation, 
and communication. 

I have thought further about a register based VM and see no strong argument for 
why registers are better than stacks. Most of the research on dynamic VMs focus 
on this distinction in order to reduce memory movement and dispatch overhead 
while executing the application. Packed functions *will* dominate execution 
time and optimizing for dispatch is an incredibly premature optimization. 

The other argument for register based VMs is instruction level parallelism. 
Again instructions don't matter much here, meaningful parallelism is happening 
at data dependencies between operators, and inside the operators themselves 
(i.e parallel matrix mul).

The point of the parallel monad approach is not to add it to the source 
language, but that the execution technique is valid way for us to get 
parallelism between operator invocations. We can view the future graph as the 
data dependency graph and do graph reduction over it. 

For example if I depend on a sequence of function calls it is valid to evaluate 
them in parallel while evaluating a future computation that may depend on its 
result. The amount of synchronization needed here is very small, and again the 
real opportunity for parallelism is inside operators in the tensor to tensor 
functions. We don't need to worry about where the results are stored, we 
essentially give it a register name when we push a future into stack position 
`n`. 

In a sense we already have an infinite register because we can address any 
stack position. In this case we can easily address a future result by 
referencing position `n`.  The only difference is the location where operations 
look for their result. We need a call stack for functions, and function calls 
are the primary operation based on observations of current workloads.

Furthermore the current approach makes the VMCompiler far simpler, and easier 
to extend. 

I personally value simplicity, we have *zero* evidence that the current 
approach is slow, in fact we have evidence of the contrary. The initial 
prototype is already faster than MxNet's executor which is used in production 
at AWS.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2810#issuecomment-476019507

Reply via email to