### RFC
This PR is based on the following RFC:
https://discuss.tvm.ai/t/rfc-improve-quantized-convolution-performance-for-armv8-architectures/6920
### High level description of the submission
The main algorithm lives in:
* topi/python/topi/arm_cpu/conv2d_gemm.py(schedule)
* topi/python/topi/arm_
CC: @u99127 @anijain2305
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-tvm/pull/5754#issuecomment-641333161
Hi @FrozenGene ,
Thanks a lot for your comments. I will address general replies here, and code
comments in a separate reply.
* I indeed read your discuss
[post](https://discuss.tvm.ai/t/tflite-and-tvm-comparison-for-quantized-models/6577/4),
but I thought the work was orthogonal to this one. M
Hi @FrozenGene ,
About the code changes.
1) It will be hard to do this. The point is that the legalization is done in
Relay before picking the strategy (thus, it is unaware of the strategy picked).
To keep both legalizations I need somehow to pass information from the strategy
(e.g., the name o
Hi @FrozenGene
Just to clarify: I am enjoying the discussion, and since the optimization space
is wild, I agree that is worth valuating different approaches.
* About the Raspberry+mobilenet v2, good to know you are working on Armv8-A
(sorry to have assumed otherwise). However, there is still th
Hi @FrozenGene ,
The idea of adding the algorithm name to the attributes would work if the
legalization step was run after we pick the strategy. It is instead run before,
so it is unaware of the strategy picked.
Maybe we could add a new pass that runs based on the strategy? Or we can hack
in `
So I mean to add a `convert_data_type` pass that is similar to
`alter_op_layout` but converts datatype (and we can do something like `if
topi_impl == 'spatial_nhwc' converts to int16`.
This doesn't seem possible directly in the `alter_op_layout` because only the
shapes are passed to that funct
Hi @FrozenGene ,
I agree that different strategies should be available to the auto-tuner. See if
the solution proposed is good enough for you (at least as a temporary
work-around). For Armv7-A or NCHW, nothing changes, we follow exactly the
previous path.
For Armv8-A and NHWC we don't convert
Hi @FrozenGene ,
I gave it another go, but switching legalization on the strategy seems very
hard (since we would need the auto-tuner to pick the best data-type for us).
So for now, we have to content with the `_alter_conv2d_layout` workaround and
try to think a bit more on how we can infer th
@anijain2305 , thanks for the review! About getting rid of the legalization, I
would not do that for now. It is in my backlog to go back to this issue and try
to retrieve the strategy from the legalization pass. This should give us more
optimization options. If that turns out to be not possible,
Hi @FrozenGene ,
Thanks for the review!
I applied your changes, but I get a (seemingly) unrelated test failure.
Could you double check please, and let me know if this has got anything to do
with my changes?
Thanks
--
You are receiving this because you are subscribed to this thread.
Reply to
It actually seems related to:
https://github.com/apache/incubator-tvm/issues/5827
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-tvm/pull/5754#issuecomment-646684376
Hi @FrozenGene , @anijain2305 ,
Any update on this review?
Also, is there a way to retrigger the tests? Or should I contact someone in
particular?
Thanks
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/a
HI @tqchen ,
I will try to sporadically comment, since this is a project I prototyped (and
enjoyed :) ) when I was in Arm.
If I understand your comment correctly, what @MeeraN7 is doing is closer to
what you are proposing. Instead of transforming a loop into a Ramp, and passing
the ramp "as i
# Motivation
In the current state, TVM float32 performance for armv8 architectures are
comparable to frameworks like TFlite (that we will use as a reference through
this RFC). However, our analysis shows that pre-quantized networks (i.e., when
data and/or weights are transformed from float32
# Introduction and motivation
Mathematically, the fixed point multiplication (FPM) can be described as:
`fpm(x,m,s) = round(x*m*2^(s-31))`
In this expression:
* `x` is the quantized value to multiply, and `m` and `s` [are an integer
multiplier and a shift](https://arxiv.org/pdf/1712.05877.pd
Hi @anijain2305,
Both Arm and non-arm machines will use the same `fixed_point_multiply` relay
operator, which will have an injective schedule associated with it, calling
into `tvm.tir.fixed_point_multiply()`.
The only difference is how the `tvm.tir.fixed_point_multiply()` is implemented.
O
Hi @tqchen,
Thanks a lot for you comments.
Actually, I understand the first part of your comment, but I am afraid I don't
follow the rest :slight_smile:
Just to fully understand:
- About adding 0.5(factor) to the bias, what do you mean? The bias is added
before the requantization (as an
Hi @anijain2305,
All correct, except that the problem about fusion is more related to the fact
that `qnn.conv2d` is lowered as a `nn.conv2d` followed by a `requantize` .
The best would be to fuse the requantization before the unpacking of the output
tensor (i.e., after the main compute node
Hi @kparzysz,
Yes pattern matching seems hard, we should mark the given set of operation from
relay (and use the group later).
That is why a middle layer solution, i.e., implementing the fpm in topi rather
than tir, might be the right approach
---
[Visit
Topic](https://discuss.tvm.ai/t
Hi @anijain2305,
Yes, they are fused together, but at the end.
`nn.conv2d` is usually implemented as three compute nodes: `pack+core+unpack`.
The requantization operator is fused after the `unpack`, while the best would
be to fuse after `core` (unpack can be hard to vectorize).
However, thi
Hi all,
In my effort to accelerate AArch64 through tensorization, I incurred into an
issue.
Basically, I am padding my input tensor, to let `tensorize` work (I need rows
to be multiple of 4 and cols to be multiple of 16).
However, bound inference removes padding (since it is not used) and
Hi Animesh,
The problem is that I need padding added in the middle of TIR on my
(transformed) data tensor.
I.e., something like
```
A1 = im2col(A)
A2 = pad(A1)
C_padded = te.compute([M,N], lambda i, j : sum(A2[i,k]*B[k,j], k)
C = unpad(C)+requantization
```
Then I tile on `C` and tensorize o
## Motivation
In recent RFCs we successfully boosted convolution performance on native
Armv8-A architectures. When using Armv8.2-A and above ISAs, developers are
provided with a richer set of instructions, among which the dot-product
instruction `udot` (or `sdot`) can be particularly useful
cc @anijain2305, @FrozenGene, @ramana-arm
---
[Visit
Topic](https://discuss.tvm.apache.org/t/rfc-accelerate-quantized-convolution-through-dot-product/7873/2)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
here](https://
## Motivation
Currently `tvmc` will only produce a dynamic library version of the network,
i.e., an `.so` file stored alongside the other artifacts. This library is
usually dynamically linked to other applications.
With this change we want to add a flag to `tvmc` to get an object file (i.e.,
cc: @leandron, @ramana-arm
---
[Visit
Topic](https://discuss.tvm.apache.org/t/rfc-optionally-include-object-file-generation-in-tvmc/8120/2)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
here](https://discuss.tvm.apache
Hi @aca88,
The object file produced by `tvmc` does not necessarily include the C runtime.
Using a `--bare-metal` flag just refers to the fact that it is mostly useful on
a bare-metal target.
Anyway, to avoid confusion, I think maybe `--object-file` might be a better
choice :slight_smile:
Hi @tqchen,
`tvmc` saves directly the `.so`, `.params` and `.json` in the the `.tar` file
it generates. This happens in `tvmc/compiler.py`. I might be wrong, but
probably this is because it doesn't want to store the `.c` files in the final
artifact (@leandron, can you confirm this?).
---
>From what I see, in `tvmc.compiler`, `export_library()` is called with a
>`mod.so` input.
I agree we could generate directly the `tar` file, but I think this was done to
avoid storing the `.c` files (@leandron will know more than me on this).
As for storing directly in the dylib, I am not
Hi all,
I am trying to improve quantized performance for memory bound operators (e.g.,
depthwise or 1x1 convolutions with small shapes).
### Bottom line question
Is there any way we can know the strategy picked by the autotuner during the
legalization pass of a quantized convolution (qnn.co
cc @anijain2305 @ramana-arm @FrozenGene (we had this discussion before)
---
[Visit
Topic](https://discuss.tvm.apache.org/t/quantized-models-and-legalization-pass/8253/2)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
he
Thanks for the reply, @FrozenGene!
The signatures of the two functions are:
```
def _alter_conv2d_layout(attrs, inputs, types, out_type):
```
```
def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
```
While they look similar, `inputs` in `_alter_conv2d_layout` contains actual
`Tensor`s
I got a bit confused above, sorry. It is not about the `inputs` but about the
`tinfos`.
Just to avoid any additional confusion I tried to print the types of the
interesting variables
**conv2d_alter_op(attrs, inputs, tinfos, out_type)**
```
print(type(inputs[0]))
#
print(type(tinfos[0]))
Hi @FrozenGene, @anijain2305
I can confirm that this works :partying_face:! Very good! Now we can implement
algorithms like QNNPack and let the tuner try them together! Thanks both guys!
As for the API change, I agree with @FrozenGene that maybe it would be cleaner
adding `tinfos` to the `
Hi @FrozenGene,
I think I see why we don't want to change the layout for no workload (no
workload means we don't even know the strategy, I think). What I am missing is
why we don't want to change the layout when `cfg.is_fallback`. In that case,
the strategy is defined, so we know how the weigh
## Introduction and motivation
This RFC is the third set of optimizations to enhance quantized convolution on
Arm architectures. To give a brief summary:
* Basic Armv8-A convolution implementation (through gemm):
https://discuss.tvm.apache.org/t/rfc-improve-quantized-convolution-performance-f
cc: @anijain2305, @FrozenGene, @matt-arm, @ramana-arm
---
[Visit
Topic](https://discuss.tvm.apache.org/t/rfc-improve-quantized-convolution-through-mmla-instruction/8336/2)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
Maybe I am wrong, but are you sure that when `cfg.is_fallback` parameters like
`cfg['tile_co']` are not defined? We usually set them to some default values (I
think). But even if we don't set them, IIUC they will get "some" value among
the possible ones. Am I missing something?
---
[Visit
Hi all,
I am trying to understand the role of the LLVM auto-vectorizer in TVM. Indeed,
in `llvm_codegen.cc` we explicitly set:
```
builder.LoopVectorize = true;
builder.SLPVectorize = true;
```
And I am trying to determine to what level TVM is relying on LLVM
auto-vectorization.
### Wh
Hi @comaniac,
May I ask how the graph ends up with a `nn.conv2d + nn.relu + nn.conv2d +
nn.relu` ? Is the graph going through a BYOC kind of partitioning (sorry if the
question is naive)?
As for S1 vs S2, could we do both? Use an heuristic like "ignore the task
without any call node" and th
Hi Andrew,
> for AOT runtime I agree we do not need JSON parsing or any of the underlying
> facilities it brings. However, given it seems like you’re planning to reuse
> the C-runtime memory allocator and interfaces in include/tvm/crt/platform.h,
> I think it would be great to continue using
Hi all,
I was finally able to have a first version of the AOT work in a PR upstream.
## PR
You can find the PR here: https://github.com/apache/tvm/pull/7785
At this stage, I gladly accept any feedback on things that can be improved in
the PR or on issues I might have overlooked. Please, help
Hi all,
I just published the AOT PR upstream: https://github.com/apache/tvm/pull/7785.
It has some conflicts probably due to the `CompileEngine` refactoring, and I
will fix that soon. I wanted just to let you guys start to have a look
@stoa I am wondering how much of your work can use the A
Also, a side comment: I will be out for Easter holidays until Tuesday (so I
will be replying back to any comments as soon as I come back :slight_smile: )
---
[Visit
Topic](https://discuss.tvm.apache.org/t/rfc-standalone-code-generation-and-c-runtime-for-stm32-bare-metal-devices/9562/8)
to
FYI: I will be out for Easter holidays until Tuesday (so I will be replying
back to any comments as soon as I come back :slight_smile: )
---
[Visit Topic](https://discuss.tvm.apache.org/t/implementing-aot-in-tvm/9206/15)
to respond.
You are receiving this because you enabled mailing list
Hi all,
Thanks for the interesting discussion! So, we all agree that there are three
points here:
* Backend API
* Calling convention
* Runtime API
As things stand today, memory allocation is part of the backend API. This will
change with global memory planning, but for now I would tend to ski
47 matches
Mail list logo