> I can see that you might want the graph to represent all the operations prior > to optimizing the implementation. I just want to point out that the qrelu > implementation can avoid the lowered resolution and can be completely cost > free by revising the downscale multiplier and zero point of a preceding > quantized output operation (qconv2d in this case). It is cost free because > the clipping values are required in any case to do the quantized range > saturation.
Yes, you are correct. And that's what exactly TFLite does. In the case of fused TFLite conv2d, the conversion will be different `TFLite.conv2d (fused relu)` will be converted to following Relay graph `qnn.conv2d -> nn.bias_add -> qnn.requantize -> clip` In this case, the cost-free conversion is manifested in the `clip` operation. We will have to add framework parsers for each framework, and most probably the resulting sequence of operators will be different for each framework. My example in my last comment was to explain the fp32 and i8 boundaries and domain conversions of my proposal that @tqchen was pointing out. -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/dmlc/tvm/issues/2351#issuecomment-508966317