Thanks, that makes sense. I was thinking that while calibration, you could use
different attributes for `simulated_quantize` and `simulated_dequantize` ops.
In the callback of calibrating an operator, one can simulate the affine space
and argue about scales and zero points. But for capturing r
I apologize for the long delay.
Thanks @electriclilies and team for nicely written RFC. I support the idea.
Reading through the comments, it seems that many of us are in agreement about
the AutoQ and its reliance on QNN extension. The mentioned pain points mostly
revolve around
* The inconsi
@kevinthesun Pinging in case you have wondered about this before
---
[Visit
Topic](https://discuss.tvm.apache.org/t/role-of-the-llvm-autovectorizer-in-tvm/8388/2)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
here](htt
Sorry for late reply. Can you try this? tinfo is nothing but just te
placeholder.
~~~
diff --git a/python/tvm/relay/qnn/op/legalizations.py
b/python/tvm/relay/qnn/op/legalizations.py
index 50e5a02f8..8add434c1 100644
--- a/python/tvm/relay/qnn/op/legalizations.py
+++ b/python/tvm/relay/qnn/op/
How about using Relay Legalize pass to add an explicit padding at the graph
level?
---
[Visit
Topic](https://discuss.tvm.ai/t/loop-partitioning-padding-and-tensorization/7753/2)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [
Hi @giuseros
You are correct that `qnn.conv2d` and `qnn.requantize` are different operators.
And both of them are lowered to a sequence of Relay operators. But, here the
strength of Relay comes in. Relay fuses `nn.conv2d` followed by a large number
of elemwise ops into one operator. This can
@tqchen The problem arises because LLVM codegen is not able to use suitable
instructions. A fixed point multiply at Relay level will have to upcast the
input tensors to int64. ARM instructions that @giuseros shared take int32
tensors and perform the upcasting internally in the HW (please corre
Thanks for the nice RFC.
Trying to understand if I missed anything. What will happen for non-ARM
machines? Are we going to use fixed_point_multiply relay operator for non-ARM
machines and then use injective schedule?
---
[Visit
Topic](https://discuss.tvm.ai/t/rfc-using-arm-intrinsics-to
I think we are getting confused because of the overloaded term quantization. To
be precise, maybe we can stick to certain terms
* *QNN Dialect* - Framework (like TF/PyTorch/MXNet) performs quantization.
Relay parser reads this pre-quantized model and creates a QNN-dialect graph.
QNN ops are l
LGTM. I think we can rename to `get_calibration_data` or `get_profiling_data`
instead of `calibrate_partition_gaph`. I think calibration means more than
collecting i/o tensors (for quantization, it means choosing min/max such that
quantized data representation is similar to float32 data repres
I push an empty commit to retrigger the CI -
https://coderwall.com/p/vkdekq/git-commit-allow-empty
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-tvm/pull/5754#issuecomment-647622695
@FrozenGene Can you please review when you get time?
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-tvm/pull/5754#issuecomment-644902827
@FrozenGene @giuseros If QNN Legalization is causing issues, we can remove QNN
legalization for ARM CPUs altogether and move the logic to Alter Op layout.
Alter op layout might become more complicated (like we might have to handle
uint8 x int8 input and kernel dtype in alter op layout now). Just
Also cc @FrozenGene @thierry @masahi
---
[Visit
Topic](https://discuss.tvm.ai/t/rfc-improve-quantized-convolution-performance-for-armv8-architectures/6920/2)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
here](https://
Ping @ziheng, I was wondering if you are pursuing this direction and have any
update.
---
[Visit
Topic](https://discuss.tvm.ai/t/rfc-search-based-automated-quantization/5483/18)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [
Hi @ziheng, I was wondering if you got a chance to work on this further. Any
kind of update?
---
[Visit
Topic](https://discuss.tvm.ai/t/rfc-search-based-automated-quantization/5483/17)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these em
Currently, Relay conv2d internally decides whether a Relay Conv2d operator is
depthwise or not. This makes code somewhat messy - lots of it conditions and
indirections, quite confusing HWOI vs HWOI kernel layout. In addition, it is
difficult to understand from the debug runtime if the conv ope
Hi @jianyuh I am getting following error when I try to run my benchmark. It
gives following error,
~~~
LLVM ERROR: Cannot select: 0x23809ef0: v16i32 = X86ISD::VPDPBUSD 0x210a09a8,
0x210a02c0, 0x19eb81b0
0x210a09a8: v16i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>,
Constant:i32<0>, Cons
Closed #3617.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/3617#event-2663467814
This is solved. Closing.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/3617#issuecomment-535267980
@yzhliu Gave it some more thoughts over last few days. I think there might be
slightly better way to deal with layouts.
* Instead of directly using the `transpose` operators, maybe we can use some
new annotation ops like `annotate.change_layout` (or maybe use
`layout_transform`). This will hav
I see. I missed the implementation detail point. My first preference is place
it inside `Type` (but I guess that maybe is not the preferred choice as of now
given how frameworks handles layout).
The second option that you give is pretty good too. However, how do we read the
layout for example i
If its ok, I will give a couple of reasons why I think treating layout as first
class citizens is important (The world can do with one *more* opinion :) )
* It seems to me that layout was an afterthought for the frameworks. They
started with just one layout, as deep learning progressed, we reali
What do you guys think about having `Layout` as a member of `TensorType`?
Currently `Type` basically means dtype and shape. I think it is very useful to
have `Layout` there as well. If thats the case, the `Legalize` API will get
arg_dtypes, and thus layout, enabling transformation based on input
Thanks @jackwish and @FrozenGene I understand your points.
This can be treated as optimization then. If the input zero point is zero OR if
the input and output quantization params are same, don't cast, directly apply
maxpool. Generally, we would like to keep QNN APIs generic. So, if MxNet for
s
Thanks @jackwish for confirming the python lowering looks good.
For max pooling, we used casting, because we have to subtract the zero point
from the quantized tensor. That subtract needs to happen in higher precision
than (u)int8. Correct me if I am wrong.
--
You are receiving this because yo
@FrozenGene Can you please review #3627
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/3617#issuecomment-517491952
@tqchen @FrozenGene Did you get a chance to take a look at this? Please let us
know your thoughts. We have some more QNN ops in the working and following this
proposal for now. Will be good if we can have some feedback here.
--
You are receiving this because you are subscribed to this thread.
@jnorwood We are using intrinsics for Skylake and there is already a PR to take
advantage of VNNI intrinsics #3388
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/3591#issuecomment-5151750
We added a QNN max_pool2d operator to show the file changes required in this
proposal. Please share your thoughts!
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/3617#issuecomment-51514418
Thanks @jackwish
This is a very good analysis. Everything makes sense. I upvote for restricting
to `(u)int8` for now for `Quantize` and `Dequantize`.
If in future, we see `(u)int16`, we can tackle then. `int32` is highly unlikely
(why not just go to `FP32` as you say).
--
You are receiving th
Relevant QNN Dialect RFC - #3591
Some QNN operators like Requantize and Conv2D are more amenable to going
through C++ lowering pass. A couple of factors where C++ implementation seems
better is - when new operator is conceptually very different from existing
operators (Requantize), when input/
@FrozenGene I dont think `requantize` should take output_min and output_max. We
can use `requantize` after/before any operator, where `relu` might not be at
all applicable. Instead, I would suggest having two clip operators. And then
relying on Relay passes to optimize the graph - in this case c
@jnorwood Yes, bias is kept outside as a separate operator. But, this can be
fused with the qnn.con2d.
Regarding the accumulation point, if we perform fusion and add the bias in
`int32` in the accumulator at the end, is it any different than preloading the
accumulator? We need to ensure that op
@FrozenGene Updated the Conv2D API. Also, added a diagram explaining how to go
from TFLite to Relay operators.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/3591#issuecomment-514265170
### QNN Conv2D operator
Tensorflow
~~~
tf.nn.quantized_conv2d(
input,
filter,
min_input,
max_input,
min_filter,
max_filter,
strides,
padding,
out_type=tf.dtypes.qint32,
dilations=[1, 1, 1, 1],
name=None
)
~~~
MxNet
~~~
mxnet.symbol.contrib.qua
@jnorwood Thanks for the comment. Both good points. I will keep those
abilities, though, outside of the scope of requantize op.
Another function (not necessarily a Relay operator) can take min/max, a config
(like nudge that zero is exactly representable) and generates scale and zero
point as per
@FrozenGene @tqchen @u99127 Can you please approve the above API, so that we
can move to next discussion. I have so many things to discuss :)
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues
Let's start with just Requantize to keep it focussed
### QNN proposal
~~~
def requantize(data,
input_scale,
input_zero_point,
output_scale,
output_zero_point,
rounding="AWAY_FROM_ZERO",
out_dtype="int8"):
@tqchen Thanks for reminding. Just created one :)
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2351#issuecomment-513414742
We are proposing a new dialect named `QNN`, that introduces a quantized version
of existing relay operators. The goal is to support the models that have been
pre-quantized in the framework.
Some important notes about QNN dialect are
* QNN operators are lowered to existing Relay operators to ens
I agree, we should move the proposal to a new thread.
Yes, I can lead the proposal discussion.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2351#issuecomment-513025651
> http://ci.tvm.ai:8080/blue/organizations/jenkins/tvm/detail/PR-3388/6/pipeline/
> Not sure why “llvm.x86.avx512.pmaddubs.w.512“ (AVX512 instruction, not VNNI
> instruction) is not recognized as an LLVM intrinsic.
This is happening because the LLVM version is 6.0 in CI as Tianqi mentioned.
You
> I will update the CI to add LLVM8 this week.
Hi @tqchen, is there any update on the LLVM8 front? We are also looking into
this and have similar test issue.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com
We have made good progress on the Quantization RFC, achieving clarity and
convergence on many points.
For this PR specifically, @tqchen and @FrozenGene, can you please comment if
this looks in line with our quantization RFC.
--
You are receiving this because you are subscribed to this thread.
R
> slight difference in a single point(0.5) is fine and likely won’t have an
> impact on final acc
Yeah, I was planning to add a rounding param to the op. For "ceil", we could
just add a 0.5 rounding without worrying about negative values. For "round', we
can be more precise. By default, we can
> One thing to be careful about is that when using shift and normalize, right
> shift corresponds to round down as opposed to round to nearest, an additional
> 0.5 equivalence needs to be added to get the round behavior
Yes, I think it is little more complicated. The std::round of -2.5 is -3.
Thanks everybody for the fruitful discussion. I think we are gradually reaching
convergence :)
I am have been prototyping the qnn.conv2d and qnn.requantize at
https://github.com/dmlc/tvm/pull/3367
I have still few lose ends to fix. I will update once I am done and then we can
discuss if the im
> And in the case when the scale is a power of two, use shift and normalize
> might be better than float scale and round
Yes, the shift and normalize can completely by done in integer scale instead of
going to Floating point (even if they are not a multiple of 2). I have been
prototyping that.
> I can see that you might want the graph to represent all the operations prior
> to optimizing the implementation. I just want to point out that the qrelu
> implementation can avoid the lowered resolution and can be completely cost
> free by revising the downscale multiplier and zero point of a
Thanks @tqchen for the detailed explanation.
Actually, my proposal is simpler. My `qnn.relu` does not convert to the three
stages that you mentioned. It only performs the `relu_int_i8`.
The frameworks (atleast TFLite and MxNet) do not go back to FP32 unless the
operator is not supported in `i8`
> In particular, refer to the current quantization pass, every value could sit
> in a domain, which could be fixed point with an implied scale, or floating
> point. Conversion between domains might be necessary and should be conducted
> in a minimum way. The default way always convert integer do
> Do we allow mix of standard ops and qnn ones?
The framework parsed graph might have a mix (as shown in the lowering of
qconv2d). But in the `relay.build` function, my first pass would be
quantize_rewrite pass, that will convert all the `qnn` ops to existing relay
ops, resulting in whole graph
@tqchen Added the case for qrelu. (I think the asymmetric lowering can be
improved further, but thats not the point).
Similarly for quantized avg pool2d, as @FrozenGene mentioned, we will still
need to upcast the tensor to int32 to avoid saturation. Additionally, we would
need to handle the zer

--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2351#issuecomment-508956
@tqchen What are your thoughts?
Seems like we are agreeing on the proposed design abstraction. There is a
concern of not being able to achieve the best schedule performance. We can try
to tackle it with fusion and schedule_tagging.
--
You are receiving this because you are subscribed to this t
@jnorwood Yes, I understand your point. We can use the clip to saturate the
values even if Relu was not fused. It fits in the design and the proposed
abstractions.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://git
@FrozenGene Thanks for the quick feedback on the design.
I understand the performance concern. Let's try to tackle them in fusion.
Fusion already performs compute_inline to bring the computation at right
location. Hopefully, with some tagging and with some arm-twisting, we can
achieve same tens


. For example -
`relay.op.qnn.conv2d` can be lowered to
~~~
fn (%quantized_data: Tensor[(2, 1, 2, 4), uint8], %weight: Tensor[(3, 1, 2, 2),
uint8]) -> Tensor[(2
Finally, we are starting to converge :)
I am proposing them on the basis of Resnet network for now.
`relay.op.qnn.conv2d`
`relay.op.qnn.dense`
`relay.op.qnn.relu`
`relay.op.qnn.max_pool2d`
`relay.op.qnn.avg_pool2d`
`relay.op.qnn.concat` (used in Inception)
`relay.op.qnn.quantize`
`relay.op.qnn.d
@jackwish Yes, `qnn` stands for a generic quantized nn, and not QNNPACK. I
think @tqchen also means the same thing.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2351#issuecomment-5071108
I completely agree with breaking down into primitive ops. Even the
`relay.op.qnn` should be broken down into primitive ops. If the primitive op
does not exist, we will discuss and maybe create one. I understand the Relay
fusion part. I am trying to make another point.
I am trying to understand
Thanks @tqchen
Of the two choices, I am inclining towards `relay.op.qnn`. My hope is that
different frameworks converge to same `qnn` ops. The `relay.op.tflite` seems to
be very specific as of now. I agree that these news ops should have a special
op_level.
I am still unclear about where to d
@tqchen @FrozenGene @ZihengJiang @zhiics @wweic @eqy
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/pull/3457#issuecomment-506844165
> Although the quantized conv result is held in uint8, it could be static
> casted to signed int8, or even fewer than 8 bit quantization. That would
> require both min and max saturations, as in the reference tflite quantized
> conv implementation
Ah, I see. That finally makes sense.
So, this i
> I think it is ok. If we do this way, we should insert one clamp if we have
> activation.
> Like our tflite frontend
Yes, I agree with that. That's exactly what I was thinking.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHu
@FrozenGene For the output_min and max, isn't the out_dtype enough? If its
uint8, we can clamp at 0 and 255. If its int8, we can clamp at -128 and 127. I
don't see any reason the values will be any different, unless you want to fuse
the quantized relu in the quantized convolution from the starti
@FrozenGene Thanks for replying. I might be wrong, but I don't think it is a
good design to take one codegen backend like QNNPACK and make changes all the
way into Relay APIs to make the connection. In my opinion, APIs must be minimal.
But, your point of using QNNPACK is completely valid. I have
@tqchen @FrozenGene @jackwish
I have added a prototype patch. I think it will be helpful to use that patch to
drive the discussion further.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues
Ok, lets try to finalize the high-level design points. Lets first discuss the
# Namespace for the tflite quantize style dialect
### Requirements
* This should support both symmetric and asymmetric.
* These ops should never go through codegen. They will be lowered to low-level
Relay ops (like exi
It seems like NHWC problem. The conv should have a arg called data_layout =
"NHWC" here
---
[Visit
Topic](https://discuss.tvm.ai/t/tflite-an-internal-invariant-was-violated-while-typechecking-your-program/2784/3)
to respond.
You are receiving this because you enabled mailing list mode.
This is most probably out of the context of the issue, but is it possible for
all of the people commenting here to join a conference call for an hour and
figure out the next steps? I can take notes and document them here for
everybody else to see. I think it will be more productive.
--
You are
I would suggest to design the infrastructure that supports both
symmetric/asymmetric quantization. We can certainly start with symmetric to
flush the flow, while keeping in mind that we can share as much infrastructure
as possible between them.
> * namespace for the tflite quantize style dialec
> For the `q_conv2d`, we will add two more arguments.
>
> ```python
> output_min=0,
> output_max=0
> ```
>
> These will be used for restrict the output range, which could be calculated
> previously.
I see what you are saying, but I am not sure if this is the right approach. In
my opinion,
> > > For the `q_conv2d`, we will add two more arguments.
> > > ```python
> > > output_min=0,
> > > output_max=0
> > > ```
> > >
> > >
> > > These will be used for restrict the output range, which could be
> > > calculated previously.
> >
> >
> > I see what you are saying, but I am not su
> Yes, I believe the MobilenetV2 relu_6 is effectively fused in by the
> downscale saturation. You might need it if you want to support their way of
> training, though.
>
> Yes Mobilenet has the q_add, but I suggest the Inceptionv3 for q_concatenate,
> since it also has concat nodes feeding int
> Hi @anijain2305 regarding the requantization, if the it is not going to put
> in conv op, the op may suppose to output FP32, otherwise the semantic is
> confusing. The requantization can convert FP32 to INT8. The multiplier/shift
> based reuantization approach introduced by TFLite is also adop
Thanks. Let's lay down the high-level API design for some of the quantized
operators. A large portion of this is coming from the following relevant
discussions. Thanks to @jackwish, @FrozenGene and @jnorwood for sharing their
experiences with quantization, and also @shoubhik for helping design t
Adding others who might be interested in this @ajtulloch @eqy @ZihengJiang
@tqchen
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/3252#issuecomment-496631850
To increase quantization support in TVM, it is necessary to support the
pre-quantized models, i.e., the models that have been quantized in the
framework itself (outside of Relay). In this issue, we are laying down the
high-level API design for some of the quantized operators. A large portion of
@FrozenGene I am interested in contributing to this Issue. Is it possible to
share the progress?
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/dmlc/tvm/issues/2351#issuecomment-492433018
82 matches
Mail list logo