> In particular, refer to the current quantization pass, every value could sit 
> in a domain, which could be fixed point with an implied scale, or floating 
> point. Conversion between domains might be necessary and should be conducted 
> in a minimum way. The default way always convert integer domain back to f32 
> and use f32 to exchange value between layers, which may not not the most 
> efficient way.

So, I think we are trying to make 2 things work together here, which are very 
difficult to merge.
The first is to perform the quantization in framework and then convert it to 
Relay graph. This is what this issue is trying to focus on. The other is to 
perform the quantization in TVM itself. Your comment that the conversion 
between two domains should be minimal applies to the entity that quantizes the 
network. For example, relu, bias_Add etc are all fused in TFLite Conv2d for the 
same reason.

If we are converting the framework quantized model to Relay graph, then I think 
we should perform the same computation as defined by the framework quantized 
graph. If the original graph has domain conversions, then we will have to 
respect that as well. We can perform some graph optimizations - like remove 
dequantize followed by quantize if same quantization parameters. I think even 
with all this inefficiencies, our fusion algorithms and fast kernels should be 
able to provide better performance than the framework execution of the 
quantized graph.

Please let me know your thoughts on this.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2351#issuecomment-508960028

Reply via email to