[Apache TVM Discuss] [Development/pre-RFC] Introducing TY-NNP backend with end2end TensorIR integration

wrongtest via Apache TVM Discuss Fri, 31 Dec 2021 00:05:51 -0800


Hi, all~

This RFC is to upstream the support for our TY-NNP accelerator backend. We are
from the AI accelerator toolchain team of
[Intellifusion](https://www.intellif.com/), who has been focusing on developing
vision processor that accelerates deep neural networks in visual recognition
and searching in endpoints, such as IP cameras and robots, as well as in cloud.

Nowadays, TVM has become the most important component in our AI software stack
and we would like to upstream our work back. We believe participating in the
open-source ecosystem will benefit both the internal software infrastructures
and our customers!

# Overall architecture

The TY-NNP refers to the neural network accelerator architecture serving a wide
range of our edge AI scenarios. TY-NNP takes a typical NPU design to offload
neural network computation workloads to various kinds of domain-specified
designed computing units. Generally, there are three kinds of computing units:

* NU (neural units)

NU is designed for high-throughput computation of typical neural-network
workloads such as Conv/Matmul. Comparing to TensorCores in NVGPU, NU works in a
coarse-grained fashion from a software perspective. Instead of
software-programming of fine-grained M * N * K mma intrinsics, NU provides
CISC-style instructions and a bundle of hardware configurations to developers.
The NU components automatically load input/weight data from input buffers,
execute fine-grained mma operations with hardware tiling control, and store
result to output buffers.

In TVM, we program NU with customized TIR intrinsics. Developers should use
schedules to lower the specified computation patterns to NU intrinsics, arrange
the on-chip input/output buffers, and perform tuning to determine the best
hardware configurations.

* VU (vector units)

VU accelerates general computation workloads which can not fit NU. TY-NNP
provides a set of on-chip VU cores, each taking its own on-chip buffer (called
VM), and a set of vectorized/scalar function units and physical registers. VU
programming is just like general vectorized programming on CPUs.

In TVM, to offload the computation to VU, developers should schedule the
computations into vectorizable form, arrange the on-chip input/output buffers,
and mark the proper computation axis with `vectorize` or replace it with VU
intrinsics.

* CU (control units)

CU can be seen as a small on-chip core and does not provide high
computation abilities. It aims to control the on-chip execution flow and the
whole on-chip kernel execution wiil starts from CU.

TY-NNP takes an explicitly managed memory hierarchy, each computing unit has
its own buffer and there is a global on-chip buffer (called DM) to transfer
data between each unit. Data transfer is explicitly done by asynchronous DMA
operations and explicit/implicit synchronizations are used to avoid hazards. In
TVM, DMA and synchronization are also represented by TIR intrinsics.

An off-chip storage (called DDR) is managed to transfer data between host and
device, which takes much larger space than on-chip buffers and supports dynamic
memory allocations. In TVM the DDR storage just corresponds to the storage
scope `kGlobal` and is managed by runtime.

# Implementation design

The current TVM compilation stack for TY-NNP is as follows:

### Relay level

* We use a fusion pass based on a dedicated hardware cost model. Beyond
traditional heuristic-based fusion for `conv-bn-relu` like patterns, it
performs a much more aggressive strategy to merge multiple anchor ops like conv
into a single device kernel. This brings opportunities to schedule multiple
anchor ops simultaneously, which we think is essential to saturate our NPU
hardware.
* A schedule-aware layout rewrite mechanism is added. Our tir schedule phase
would rewrite tensor layouts to fit hardware features, so we modify the compile
engine to give a chance of compatible updates at relay level.

### TIR level

A key difference from the current cpu/gpu design is that we try to
schedule&tune blocks for multiple ops. It is ok to compute a single heavy op
for a single kernel on a gpu device. But we think NPU may prefer to launch a
block of consecutive ops to avoid frequent kernel launches. Thus, the proposed
fusion pass described above is a way to achieve this.

Also, since the main efforts of tvm community are on cpu/gpu backend, there do
exist pain points when developing tir supports for NPU fashion backend. We take
some struggling to make it work through the standard schedule -> lower flow.

* We use TensorIR schedule
(https://discuss.tvm.apache.org/t/rfc-tensorir-a-schedulable-ir-for-tvm/7872)
to schedule the computations. **This is the first trial of TensorIR schedule on
NPU infrastructures as far as we know.**
* A set of new schedule primitives are added to utilize hardware features.
* A set of new tir passes are added to utilize hardware features.
* We use `device_scope` attr to mark the kernel part of the code. The community
host-dev split mechanism works just well for us.

### Target level

* For codegen, we developed `class CodeGenTYNNPLLVM: public CodeGenLLVM`
* For runtime, we developed `class TYNNPDeviceAPI: public DeviceAPI`

# How to run

### Dependencies

The TY-NNP backend depends on the following prebuilt binaries:

1. LLVM libraries with TY-NNP target support
2. TY-NNP assembler
3. TY-NNP driver libraries with integrated simulator

They are available after upstreaming. Also, we are more than glad to provide
Docker environments for anyone interested in our hardware.

### Playing

All dependencies are integrated into codegen and runtime, so users can just use
general interfaces in a normal way with only two extra cmake options.

```python
# enable TY-NNP support in config.cmake
set(USE_TYNNP ${path to TY-NNP toolchains})
set(USE_LLVM ${path to llvm-config of TY-NNP target support})
```

```python
# test from tir
with ty_nnp.build_config(): # customized pass context
dev = tvm.ty_nnp(0)
a = tvm.nd.array(a_np, dev)
b = tvm.nd.array(b_np, dev)
f = tvm.build(primfunc, target="ty-nnp")
f(a, b)
```

```python
# test from relay
with ty_nnp.build_config(): # customized pass context
dev = tvm.ty_nnp(0)
a = tvm.nd.array(a_np, dev)
lib = tvm.build(relay_module, target="ty-nnp")
m = graph_executor.GraphModule(lib["default"](dev))
m.set_input(0, a)
m.run()
b = m.get_output(0)
```

### CI Integration

Although we have managed full scenarios tests in our internal repositories, it
would be great if some key features (eg, conv op) could get covered by
community CIs. We could provide Docker images which enable the backend testing
environments. Any detailed suggestions for CI integration are very welcome!

# What we want to contribute

Currently, our backend codes lie in `contrib` of corresponding code directories:

* c++: `src/contrib/ty_nnp` (except codegen/runtime)
* python: `python/tvm/contrib/ty_nnp`
* unittests: `tests/python/contrib/ty_nnp`

They can be summarized as following aspects:

### TY-NNP codegen and runtime

Runtime is in `src/runtime/contrib/ty_nnp` and LLVM codegen is in
`src/target/ty_nnp`

* This will introduce a new device type `kDLTYNNP` and a new target name
`TY-NNP`. The corresponding codegen/runtime codes are incremental and do not
affect upstream source codes.
* A set of new `StorageRank` enums have to be added to specify different
on-chip buffer types. Ideally, we are glad to know the best way to define these
target-related informations.

### TIR optimizations on TY-NNP target

TIR codes are mainly in `src/contrib/ty_nnp/tir`

* This will introduce a set of backend TIR passes for TY-NNP hardware features,
such as DMA intrinsics, synchronizations, static address allocations and etc.
They are designed for our hardware only. Users call `ty_nnp.build_config()` to
get the specific pass context.
* In tvm.build process, we introduce more flexible configurations, such as
disabling standard passes which are incompatible with ours.

### TensorIR schedule proposal

* We would like to introduce a set of new schedule primitives

* **Imperative loop partition**

Users can either partition the loops and blocks at the schedule phase
immediately or lazily perform it in `loop_partition` pass. It helps a lot in
non-perfect tiling cases or where boundary conditions are not directly
supported by the hardware.

```
_, _, h_axis, w_axis, _, = s.get_loops(block)

# imperative
partitioned = s.loop_partition([h_axis, w_axis], lazy=False)
# partitioned is a tree structured data structure tracing partitioned
blocks
my_visit(partitioned)

# lazy, only hint tag added
s.loop_partition([h_axis, w_axis], lazy=True)
```

* **Buffer/loop primitives duality**

TVM has already provided very convenient primitives for loops. However,
it could be great to explicitly manage memory orders as well as computation
orders. We believe for many NPU scenarios, it is very essential to control data
layouts of on-chip memory buffers. TensorIR can control buffer dim alignment
but it is not enough. On-chip buffers with locality to NPU specified function
units (imagine TensorCore) can take totally different memory layouts. It also
benefits any infrastructure with manageable memory hierarchies.

Just like we get nested loops by `get_loops(block)`, we make dualed
designs to get buffer axes like `get_write_buffer_axes(block, write_idx)` and
conduct buffer layout schedule on these axes. Below is a table listing for
primitives duality, the highlighted are proposed new primitives:

| Loop schedule | Buffer schedule |
| ------ | ------ |
| get_loops | **get_write_buffer_axes**, **get_read_buffer_axes** |
| split | **buffer_split** |
| fuse | **buffer_fuse** |
| reorder | **buffer_reorder** |
| **loop_extent_align** | buffer_dim_align |

* Accommodated scheduling and tuning. Mainly in `python/tvm/contrib/ty_nnp/topi`

Currently the schedule/tuning logic is designed for our hardware features
only. However, we are very interested in whether there are common methodologies
of such NPU schedule designs. We would like to refine our codes to a more
general schedule/tuning support into TensorIR modules if such opportunities
exist!

### Relay accommodation

Mainly in `python/tvm/contrib/ty_nnp/relay` and `src/contrib/ty_nnp/relay`

As described in the implementation design

* Currently our fusion pass depends on hardware specified cost models. We'd
like to refine our code to form an auto-fusion framework with third-party cost
models if it is possible.
* Schedule-aware layout rewrite transformation. We add a relay pass to perform
a "pre-schedule" which determines the best data/weight layout, and then the
pass can rewrite the relay level layouts according to the signature of
primfunc. Currently, we have to hack the compile engine to find the
pre-scheduled PrimFunc from a standalone cache, we are glad to know what is the
best way to achieve this goal.
* To utilize the scheduling described above, we propose to insert a
customization point in compile engine, which could be different from the
fallback schedule, auto-schedule and meta-schedule.
* We add some customized relay ops such as `sum_pool2d` and etc, glad to add
them as standard relay ops if they are generally useful.

# Summary

* We implemented TY-NNP runtime and codegen. They are introduced as standalone
modules with USE_TYNNP compile option.
* We integrate TensorIR (and corresponding relay adaptions) to perform schedule
and optimization for our target. This will introduce some adaptations and new
features to upstream codes. Perhaps we should split them into standalone
PR/RFCs?

Thanks for all your attention, and any suggestions or comments would be
appreciated. We are proud to contribute consistently as part of the community.

---
[Visit
Topic](https://discuss.tvm.apache.org/t/introducing-ty-nnp-backend-with-end2end-tensorir-integration/11807/1)
to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click
here](https://discuss.tvm.apache.org/email/unsubscribe/bb2e640c7911dd8e8742cd6dab36a6eb540e9d68a07e0029dab830522cb3353a).

[Apache TVM Discuss] [Development/pre-RFC] Introducing TY-NNP backend with end2end TensorIR integration

Reply via email to