Hi, all~

This RFC is to upstream the support for our TY-NNP accelerator backend. We are 
from the AI accelerator toolchain team of 
[Intellifusion](https://www.intellif.com/), who has been focusing on developing 
vision processor that accelerates deep neural networks in visual recognition 
and searching in endpoints, such as IP cameras and robots, as well as in cloud.

Nowadays, TVM has become the most important component in our AI software stack 
and we would like to upstream our work back. We believe participating in the 
open-source ecosystem will benefit both the internal software infrastructures 
and our customers!

# Overall architecture

The TY-NNP refers to the neural network accelerator architecture serving a wide 
range of our edge AI scenarios. TY-NNP takes a typical NPU design to offload 
neural network computation workloads to various kinds of domain-specified 
designed computing units. Generally, there are three kinds of computing units:

* NU (neural units)

    NU is designed for high-throughput computation of typical neural-network 
workloads such as Conv/Matmul. Comparing to TensorCores in NVGPU, NU works in a 
coarse-grained fashion from a software perspective. Instead of 
software-programming of fine-grained M * N * K mma intrinsics, NU provides 
CISC-style instructions and a bundle of hardware configurations to developers. 
The NU components automatically load input/weight data from input buffers, 
execute fine-grained mma operations with hardware tiling control, and store 
result to output buffers.

    In TVM, we program NU with customized TIR intrinsics. Developers should use 
schedules to lower the specified computation patterns to NU intrinsics, arrange 
the on-chip input/output buffers, and perform tuning to determine the best 
hardware configurations.

* VU (vector units)

    VU accelerates general computation workloads which can not fit NU. TY-NNP 
provides a set of on-chip VU cores, each taking its own on-chip buffer (called 
VM), and a set of vectorized/scalar function units and physical registers. VU 
programming is just like general vectorized programming on CPUs.

    In TVM, to offload the computation to VU, developers should schedule the 
computations into vectorizable form, arrange the on-chip input/output buffers, 
and mark the proper computation axis with `vectorize` or replace it with VU 
intrinsics.

* CU (control units)

    CU can be seen as a small on-chip core and does not provide high 
computation abilities. It aims to control the on-chip execution flow and the 
whole on-chip kernel execution wiil starts from CU.

TY-NNP takes an explicitly managed memory hierarchy, each computing unit has 
its own buffer and there is a global on-chip buffer (called DM) to transfer 
data between each unit. Data transfer is explicitly done by asynchronous DMA 
operations and explicit/implicit synchronizations are used to avoid hazards. In 
TVM, DMA and synchronization are also represented by TIR intrinsics.

An off-chip storage (called DDR) is managed to transfer data between host and 
device, which takes much larger space than on-chip buffers and supports dynamic 
memory allocations. In TVM the DDR storage just corresponds to the storage 
scope `kGlobal` and is managed by runtime.

# Implementation design

The current TVM compilation stack for TY-NNP is as follows:

### Relay level

* We use a fusion pass based on a dedicated hardware cost model. Beyond 
traditional heuristic-based fusion for `conv-bn-relu` like patterns, it 
performs a much more aggressive strategy to merge multiple anchor ops like conv 
into a single device kernel. This brings opportunities to schedule multiple 
anchor ops simultaneously, which we think is essential to saturate our NPU 
hardware.
* A schedule-aware layout rewrite mechanism is added. Our tir schedule phase 
would rewrite tensor layouts to fit hardware features, so we modify the compile 
engine to give a chance of compatible updates at relay level.

### TIR level

A key difference from the current cpu/gpu design is that we try to 
schedule&tune blocks for multiple ops. It is ok to compute a single heavy op 
for a single kernel on a gpu device. But we think NPU may prefer to launch a 
block of consecutive ops to avoid frequent kernel launches. Thus, the proposed 
fusion pass described above is a way to achieve this.

Also, since the main efforts of tvm community are on cpu/gpu backend, there do 
exist pain points when developing tir supports for NPU fashion backend. We take 
some struggling to make it work through the standard schedule -> lower flow.

* We use TensorIR schedule 
(https://discuss.tvm.apache.org/t/rfc-tensorir-a-schedulable-ir-for-tvm/7872) 
to schedule the computations. **This is the first trial of TensorIR schedule on 
NPU infrastructures as far as we know.**
* A set of new schedule primitives are added to utilize hardware features.
* A set of new tir passes are added to utilize hardware features.
* We use `device_scope` attr to mark the kernel part of the code. The community 
host-dev split mechanism works just well for us.

### Target level

* For codegen, we developed `class CodeGenTYNNPLLVM: public CodeGenLLVM`
* For runtime, we developed `class TYNNPDeviceAPI: public DeviceAPI`

# How to run

### Dependencies

The TY-NNP backend depends on the following prebuilt binaries:

1. LLVM libraries with TY-NNP target support
2. TY-NNP assembler
3. TY-NNP driver libraries with integrated simulator

They are available after upstreaming. Also, we are more than glad to provide 
Docker environments for anyone interested in our hardware.

### Playing

All dependencies are integrated into codegen and runtime, so users can just use 
general interfaces in a normal way with only two extra cmake options.

```python
# enable TY-NNP support in config.cmake
set(USE_TYNNP ${path to TY-NNP toolchains})
set(USE_LLVM ${path to llvm-config of TY-NNP target support})
```

```python
# test from tir
with ty_nnp.build_config():  # customized pass context
    dev = tvm.ty_nnp(0)
    a = tvm.nd.array(a_np, dev) 
    b = tvm.nd.array(b_np, dev) 
    f = tvm.build(primfunc, target="ty-nnp")
    f(a, b)
```

```python
# test from relay
with ty_nnp.build_config():  # customized pass context
    dev = tvm.ty_nnp(0)
    a = tvm.nd.array(a_np, dev)
    lib = tvm.build(relay_module, target="ty-nnp")
    m = graph_executor.GraphModule(lib["default"](dev))
    m.set_input(0, a)
    m.run()
    b = m.get_output(0)
```

### CI Integration

Although we have managed full scenarios tests in our internal repositories, it 
would be great if some key features (eg, conv op) could get covered by 
community CIs. We could provide Docker images which enable the backend testing 
environments. Any detailed suggestions for CI integration are very welcome!

# What we want to contribute

Currently, our backend codes lie in `contrib` of corresponding code directories:

* c++: `src/contrib/ty_nnp` (except codegen/runtime)
* python: `python/tvm/contrib/ty_nnp`
* unittests: `tests/python/contrib/ty_nnp`

They can be summarized as following aspects:

### TY-NNP codegen and runtime

Runtime is in `src/runtime/contrib/ty_nnp` and LLVM codegen is in 
`src/target/ty_nnp`

* This will introduce a new device type `kDLTYNNP` and a new target name 
`TY-NNP`. The corresponding codegen/runtime codes are incremental and do not 
affect upstream source codes.
* A set of new `StorageRank` enums have to be added to specify different 
on-chip buffer types. Ideally, we are glad to know the best way to define these 
target-related informations.

### TIR optimizations on TY-NNP target

TIR codes are mainly in `src/contrib/ty_nnp/tir`

* This will introduce a set of backend TIR passes for TY-NNP hardware features, 
such as DMA intrinsics, synchronizations, static address allocations and etc. 
They are designed for our hardware only. Users call `ty_nnp.build_config()` to 
get the specific pass context.
* In tvm.build process, we introduce more flexible configurations, such as 
disabling standard passes which are incompatible with ours.

### TensorIR schedule proposal

* We would like to introduce a set of new schedule primitives
  
   * **Imperative loop partition**

     Users can either partition the loops and blocks at the schedule phase 
immediately or lazily perform it in `loop_partition` pass. It helps a lot in 
non-perfect tiling cases or where boundary conditions are not directly 
supported by the hardware.

      ```
      _, _, h_axis, w_axis, _, = s.get_loops(block)

      # imperative
      partitioned = s.loop_partition([h_axis, w_axis], lazy=False)
      # partitioned is a tree structured data structure tracing partitioned 
blocks
      my_visit(partitioned)

      # lazy, only hint tag added
      s.loop_partition([h_axis, w_axis], lazy=True)
      ```
    
    * **Buffer/loop primitives duality**

      TVM has already provided very convenient primitives for loops. However, 
it could be great to explicitly manage memory orders as well as computation 
orders. We believe for many NPU scenarios, it is very essential to control data 
layouts of on-chip memory buffers. TensorIR can control buffer dim alignment 
but it is not enough. On-chip buffers with locality to NPU specified function 
units (imagine TensorCore) can take totally different memory layouts. It also 
benefits any infrastructure with manageable memory hierarchies.

      Just like we get nested loops by `get_loops(block)`, we make dualed 
designs to get buffer axes like `get_write_buffer_axes(block, write_idx)` and 
conduct buffer layout schedule on these axes. Below is a table listing for 
primitives duality, the highlighted are proposed new primitives:

      | Loop schedule | Buffer schedule |
      | ------ | ------ |
      | get_loops | **get_write_buffer_axes**, **get_read_buffer_axes** |
      | split | **buffer_split** |
      | fuse | **buffer_fuse** |
      | reorder | **buffer_reorder** |
      | **loop_extent_align** | buffer_dim_align |

* Accommodated scheduling and tuning. Mainly in `python/tvm/contrib/ty_nnp/topi`

    Currently the schedule/tuning logic is designed for our hardware features 
only. However, we are very interested in whether there are common methodologies 
of such NPU schedule designs. We would like to refine our codes to a more 
general schedule/tuning support into TensorIR modules if such opportunities 
exist!

### Relay accommodation

Mainly in `python/tvm/contrib/ty_nnp/relay` and `src/contrib/ty_nnp/relay`

As described in the implementation design

* Currently our fusion pass depends on hardware specified cost models. We'd 
like to refine our code to form an auto-fusion framework with third-party cost 
models if it is possible.
* Schedule-aware layout rewrite transformation. We add a relay pass to perform 
a "pre-schedule" which determines the best data/weight layout, and then the 
pass can rewrite the relay level layouts according to the signature of 
primfunc. Currently, we have to hack the compile engine to find the 
pre-scheduled PrimFunc from a standalone cache, we are glad to know what is the 
best way to achieve this goal.
* To utilize the scheduling described above, we propose to insert a 
customization point in compile engine, which could be different from the 
fallback schedule, auto-schedule and meta-schedule.
* We add some customized relay ops such as `sum_pool2d` and etc, glad to add 
them as standard relay ops if they are generally useful.

# Summary

* We implemented TY-NNP runtime and codegen. They are introduced as standalone 
modules with USE_TYNNP compile option.
* We integrate TensorIR (and corresponding relay adaptions) to perform schedule 
and optimization for our target. This will introduce some adaptations and new 
features to upstream codes. Perhaps we should split them into standalone 
PR/RFCs?

Thanks for all your attention, and any suggestions or comments would be 
appreciated. We are proud to contribute consistently as part of the community.





---
[Visit 
Topic](https://discuss.tvm.apache.org/t/introducing-ty-nnp-backend-with-end2end-tensorir-integration/11807/1)
 to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/bb2e640c7911dd8e8742cd6dab36a6eb540e9d68a07e0029dab830522cb3353a).

Reply via email to