> In particular, refer to the current quantization pass, every value could sit > in a domain, which could be fixed point with an implied scale, or floating > point. Conversion between domains might be necessary and should be conducted > in a minimum way. The default way always convert integer domain back to f32 > and use f32 to exchange value between layers, which may not not the most > efficient way.
So, I think we are trying to make 2 things work together here, which are very difficult to merge. The first is to perform the quantization in framework and then convert it to Relay graph. This is what this issue is trying to focus on. The other is to perform the quantization in TVM itself. Your comment that the conversion between two domains should be minimal applies to the entity that quantizes the network. For example, relu, bias_Add etc are all fused in TFLite Conv2d for the same reason. If we are converting the framework quantized model to Relay graph, then I think we should perform the same computation as defined by the framework quantized graph. If the original graph has domain conversions, then we will have to respect that as well. We can perform some graph optimizations - like remove dequantize followed by quantize if same quantization parameters. I think even with all this inefficiencies, our fusion algorithms and fast kernels should be able to provide better performance than the framework execution of the quantized graph. Please let me know your thoughts on this. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/dmlc/tvm/issues/2351#issuecomment-508960028