[Apache TVM Discuss] [Development/RFC] [RFC] TensorIR: A schedulable IR for TVM

2020-09-22 Thread Lianmin Zheng via Apache TVM Discuss


@jcf94 @junrushao1994 Sorry, both of you don't understand my question correctly.

I mean the original TE is a declarative language so it can know all 
transformation before it starts to generate low-level AST. But the new schedule 
primitives are done imperatively. In the original TE, we can share some 
analysis results (e.g. dependency analysis), so it is expected to be faster.





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-tensorir-a-schedulable-ir-for-tvm/7872/37)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/0253ff039aa350b4b524bc1200e2b1a0b902940762ecf5975720adafafe655da).


[Apache TVM Discuss] [Development/RFC] [RFC] Enable TVM QNN on RISC-V with Subword SIMD Computation

2020-09-22 Thread yrchen via Apache TVM Discuss


Hello, we're the team from NTHU (National Tsing-Hua University), Taiwan. Our 
team mainly focuses on the design with supporting TVM on RISC-V architecture 
with SIMD instructions. In this RFC, we target on the application for RISC-V P 
extension(RVP). This is the extension for RISC-V DSP and subword SIMD 
extension. Note that a preliminary version of this work is reported at RISC-V 
Global Forum, Sep. 3, 2020, Lightning talk session([video 
link](https://www.youtube.com/watch?v=1nCn619cXJw&list=PL85jopFZCnbNDtFbl72oU0_8vANrljnh7&index=7
 )).

## Intro of RISC-V P extension(RVP)
RISC-V is an open source ISA with multiple extensions for different application 
needs. For vector computation, RISC-V provides "V" and "P" extension to support 
superword SIMD and subword SIMD, respectively. Here we target on RVP, it's 
designed for embedded processors or DSP-like devices. All of computation use 
general purpose registers (32, 64 bits) with lower precision numerical such as 
fixed point and integer. In our previous work below at TVM conference, we give 
the flow for fixed-point flow. As we learn that there is a QNN flow in TVM, we 
devise the TVM QNN flow for RISC-V P extension. This will make our flow more 
compatible with existing TVM flow.

The previous work for RVP in TVM for fixed-point flow is given below.
- Supporting TVM on RISC-V Architectures with SIMD computation ([video 
link](https://youtu.be/7-EaUUC6QZs?list=PLTPQEx-31JXjA2ZmvYT5s0RqDXFXTSjyL&t=2078))

The specification of RISC-V P extension is as follows.
- https://github.com/riscv/riscv-p-spec/blob/master/P-ext-proposal.adoc

## Motivation
As we're trying to find a friendly application that related to RVP, we found 
QNN as the best practice for us. Especially for pre-quantized flow from QNN 
dialect, most of Ops are either in int8/uint8 or int32, these are suitable to 
enable subword SIMD computation. As TVM upstream doesn't have support for 
RISC-V in topi implement or any handling for scheduling. We want to propose our 
work which mainly focuses on enabling tensorization on `conv2d_nchw_int8` and 
vectorization on Ops in int32.

## Approach
### Outline
1. New target : `riscv_cpu`
2. Introduce an intrinsic for dot-product in convolution, and enable it by 
tensorization.
3. Vectorize Ops to generate SIMD pattern (add).
4. Introduce a custom runtime to easily generate executable files for the Spike 
simulator.
5. Run spike to get the result.

### RISC-V Target
- register new TVM target : `riscv_cpu`
- using llvm as our target backend with `--mtriple=riscv64-unknown-elf 
--system-lib`
- add `codegen_riscv.cc` as RISC-V specific code generator
- register for `target_riscv32` and `target_riscv64`  
- register strategy, specially handle schedule for `conv2d_nchw_int8` with 
tensorize
- uses x86's compute/schedule for others
- since Spike doesn't support parallel computing, we use an empty schedule for 
`schedule_injective()`, except for Ops that gonna be vectorized

### Intrinsic for dot product
In order to efficiently executing convolution, we propose to use the following 
instructions in RVP : 
- **smaqa** : [Signed Multiply Four Bytes with 32-bit 
Adds](https://github.com/riscv/riscv-p-spec/blob/master/P-ext-proposal.adoc#5109-smaqa-signed-multiply-four-bytes-with-32-bit-adds)
- **smaqa.su** : [Signed and Unsigned Multiply Four Bytes with 32-bit 
Adds](https://github.com/riscv/riscv-p-spec/blob/master/P-ext-proposal.adoc#5110-smaqasu-signed-and-unsigned-multiply-four-bytes-with-32-bit-adds)
- with `helper_change_dtypes_to_uint8_int8()` from x86 legalize flow, we 
can get uint8 x int8

Instructions above accumulate the product into 32-bits directly, it save the 
effort to save temp result as 16-bits, also preserve the accuracy compared with 
using SIMD Mul with 8-bits. The `int32_lanes` is fixed as **2** since maximum 
length of a register in RVP is 64-bits. This is done in one instruction and 
with plenty of subword parallelism.

Intrinc func is delcared as : 
```python
# num_int8_elements = 4
# int32_lanes = 2

def _intrin_func(ins, outs):
def _instr(index):
ib = tvm.tir.ir_builder.create()
if index == 1:
ib.emit(outs[0].vstore(0, tvm.tir.const(0, 'int32x%d' % 
(int32_lanes
return ib.get()

dtype_a = '%s8x%d' % (data_dtype, num_int8_elements)
dtype_b = '%s8x%d' % (kernel_dtype, int32_lanes * num_int8_elements)
dtype_c = 'int32x%d' % (int32_lanes)

a_int8 = ins[0].vload([0], dtype_a)
re_int32 = tvm.tir.call_intrin('int32', 'tir.reinterpret', a_int8)
vec_ai32 = re_int32.astype(dtype_c)

vec_a = tvm.tir.call_intrin(dtype_b, 'tir.reinterpret', vec_ai32)
vec_b = ins[1].vload([0, 0], dtype_b)

# Call intrinsic for RVP
d_dtype = 's' if data_dtype == 'int' else 'u'
k_dtype = 's' if kernel_dtype == 'int' else 'u'
if d_dtype == 'u' and k_dtype == 's':
inst = 'llvm.riscv.simd.%s%sdot.v%di32

[Apache TVM Discuss] [Development/RFC] [RFC] TensorIR: A schedulable IR for TVM

2020-09-22 Thread Bohan Hou via Apache TVM Discuss


[quote="merrymercy, post:37, topic:7872"]
I mean the original TE is a declarative language so it can know all 
transformation before it starts to generate low-level AST. But the new schedule 
primitives are done imperatively. In the original TE, we can share some 
analysis results (e.g. dependency analysis), so it is expected to be faster.
[/quote]

@merrymercy Good question! Here's an example of TIR's schedule.
```python
s = tir.create_schedule(original_func)

update = s.get_block("C")
i, j, k = s.get_axes(update)
i_o, i_i = s.split(i, bn)
j_o, j_i = s.split(j, bn)
k_o, k_i = s.split(k, 4)
s.reorder(i_o, j_o, k_o, k_i, i_i, j_i)
```

TIR's schedule is not totally stateless. Scope info, dependency graph info is 
actively maintained during the scheduling process in class Schedule. We don't 
calculate them each time we apply a new primitive. After lowering to TIR 
without blocks, we don't maintain these info any more since it is not 
schedulable.

All in all, it is good to run the benchmark to compare them in practice. I hope 
I understand your question correctly. :smile:





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-tensorir-a-schedulable-ir-for-tvm/7872/38)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/010365ba77175ff89f5b27d2ef08cfa25d6925e8d1de22308415b0f0a328715e).


[Apache TVM Discuss] [Development/RFC] [RFC] Enable TVM QNN on RISC-V with Subword SIMD Computation

2020-09-22 Thread Cody H. Yu via Apache TVM Discuss


Thanks for the RFC!
While I'm not familiar with the current RISC-V applications, I'm carious about 
the purpose of running Spike simulator and what would be the usual next step 
after it.

I also have some questions/thoughts about the implementation. In general I'm 
thinking if it would be better to integrate this flow via BYOC to provide more 
flexibility and opportunities for future hereogeneous execution.

1. I suppose Spike is a general processor, meaning that it is supposed to be 
executing any operators.

2. You mentioned to use LLVM as the backend. How does this LLVM backend overlap 
to the current TVM LLVM backend? Will you reuse most of it, or you almost build 
another backend using LLVM?

3. I didn't quite get the point of "since Spike doesn’t support parallel 
computing, we use an empty schedule for  `schedule_injective()` , except for 
Ops that gonna be vectorized". Does that mean you still have schedules for the 
ops that can be vectorized? If so, do we need someone to write schedules for 
RISC-V P on Spike in TOPI?

4. In terms of the runtime, currently TVM graph runtime includes several 
modules, such as metadata module and external runtime modules (for the case of 
BYOC). Where would your custom runtime be?

cc @zhiics





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-enable-tvm-qnn-on-risc-v-with-subword-simd-computation/7967/2)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/636db6c8e192805911e59d7f786541f4e675829888c7c5df13914581b9c5bbf3).


[Apache TVM Discuss] [Development] Strassen Algorithm for Dense

2020-09-22 Thread zj via Apache TVM Discuss


Oh, I see, thanks for your kind reply.





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/strassen-algorithm-for-dense/2661/17) 
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/1f5a22c3294c88b597b5b6921a5d7bad3495623e56f27a02a6933a612da992db).


[Apache TVM Discuss] [Development/RFC] [RFC] Enable TVM QNN on RISC-V with Subword SIMD Computation

2020-09-22 Thread Andrew Reusch via Apache TVM Discuss


thanks @yrchen and colleagues for the RFC! overall it's very exciting work. a 
couple of thoughts
- is your eventual target bare metal devices, or does your runtime require a 
kernel? 
- `riscv_cpu` target: in the past we had introduced a special `micro_dev` 
target for µTVM work. recently, we deprecated that in favor of `llvm` and `c` 
targets. then, when creating the list of candidate schedules for a given op, we 
(for ARM) analyze the ISA supported by the CPU in `-mcpu`. is it possible to do 
something similar with risc-v (I.e. encode the P extension in some flag 
`-mcpu=rv32p`)?
- LLVM support for riscv P extension, and codegen: since you will need to build 
TVM against a forked LLVM, is it possible to use the `c` backend for any tests 
in the CI, until LLVM formally supports RISC-V P? it could be possible then to 
include a forked llvm compiler in one of the CI docker images, but still 
compile TVM against mainline LLVM. you could take a look at the [GEMM 
impl](https://github.com/apache/incubator-tvm/blob/master/python/tvm/topi/arm_cpu/cortex_m7/micro_kernel/gemm.py)
 for cortex-m7 as an example of how to do that.
- RISC-V custom runtime: your sample `host.cpp` link was broken, but is it the 
one 
[here](https://github.com/nthu-pllab/RISCV-DLR/blob/master/example/pre_quant_mobilenet_v1_tflite/host.cpp)?
 I'm also beginning to look at AOT compilation, which looks somewhat similar to 
your `kernel.inc` code (but would be generated from TVM). there are some 
additional considerations such as memory planning that may depend more on the 
device layout. do you have a full example of the `kernel.inc` anywhere I could 
look at?
 - looks like the function signatures in your DLR differ from the typically 
generated signature: 
```
typedef int (*TVMBackendPackedCFunc)(TVMValue* args, int* type_codes, int 
num_args,
 TVMValue* out_ret_value, int* 
out_ret_tcode,
 void* resource_handle);
```
seems like the main difference between this func and `DLR` func is lack of 
out_* and resource_handle params?
 - did you try using the new µTVM RPC server-based runtime with spike? this 
would allow you to use the graph runtime in the TVM python binary and perform 
autotuning. would it be possible to use that to submit the schedules as one PR 
and then split any runtime changes into another? we modified the 
[micro_tflite](https://tvm.apache.org/docs/tutorials/micro/micro_tflite.html) 
tutorial to demonstrate one use of that runtime.
 - I don't quite understand your evaluation numbers. are these measured over a 
fixed time period? otherwise, it seems like there should be fewer instructions 
executed using the intrinsic for one inference run, correct?
- what is your plan for upstreaming binutils and riscv-isa-sim work?
- for testing in CI, would we need to build a spike docker image?





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-enable-tvm-qnn-on-risc-v-with-subword-simd-computation/7967/3)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/97e592455be9ff7eca6c949380147c4f1214dd22ce762d0ddcdc47ee569825f0).


[Apache TVM Discuss] [Development/RFC] [RFC] Enable TVM QNN on RISC-V with Subword SIMD Computation

2020-09-22 Thread tqchen via Apache TVM Discuss


Thanks for great discussions. I agree that it would be really nice to make use 
of uTVM RPC runtime with spike in the place of the specifically runtime.





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-enable-tvm-qnn-on-risc-v-with-subword-simd-computation/7967/4)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/fa9fd8f393adbfc2a676c2e0bf9505ce5bbfe74091ef8daa793e2629145bb6e7).


[Apache TVM Discuss] [Development/RFC] [RFC] Differentiable tensor expression (Create and verify backward op automatically)

2020-09-22 Thread Yizhi Liu via Apache TVM Discuss


[quote="wrongtest, post:3, topic:7960"]
If I have some common neural network structure such as resnet50 at hand, can I 
just use autodiff to get backward computation graph?
[/quote]
graph-wise I think you can refer to 
[relay.transform.gradient](https://github.com/apache/incubator-tvm/blob/master/python/tvm/relay/transform/transform.py#L713)
 and as you lower the differentiated graph, you may leverage the tensor-level 
autodiff 
([te.gradient](https://github.com/apache/incubator-tvm/blob/master/python/tvm/te/autodiff.py#L22)).
 Though tensor gradients now are mostly manually written.

[quote="wrongtest, post:3, topic:7960"]
Is there some description about common ops which can be coveraged by autodiff?
[/quote]
You may refer to [test 
cases](https://github.com/apache/incubator-tvm/blob/master/tests/python/unittest/test_te_autodiff.py)

[quote="wrongtest, post:3, topic:7960"]
Can te.scan() be supported?
[/quote]
currently it is not supported.





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/rfc-differentiable-tensor-expression-create-and-verify-backward-op-automatically/7960/5)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/8df628e061512b8ab69723e918ccc46549cbf19313faa3902569142ab512b2c2).


[Apache TVM Discuss] [Development/RFC] Linalg support for matrix determinant and inverse

2020-09-22 Thread declmal via Apache TVM Discuss


I wonder whether tvm supports the following operators:

1. matrix determinant (`linalg_det`).

2. matrix inversion (`linalg_inverse`).

I searched the topi and relay library but failed to find these operators, and 
also failed to come up with an op-level equivalent transformation solution 
based on the current op inventory.

PS. 
In my research field, I need to build customized computaion graphs which 
include `linalg_det` and `linalg_inverse`, and I need to deploy the model in 
portable devices for online computation.





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/linalg-support-for-matrix-determinant-and-inverse/7973/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/9a45341054af369a9d7f1ee8fbb2aec89fb94060ebf3f73661e5087abe3dfbd5).