# Summary
[summary]: #summary
I want to write an RFC to provide an API which can be used by the C runtime to 
abstract the variety of driver APIs for different platforms. This is 
specifically catering towards RTOS abstractions for embedded device drivers. 

# Motivation
[motivation]: #motivation

When using an accelerator, such as the [Arm® 
Ethos™-U](https://github.com/apache/tvm-rfcs/pull/11), an Embedded 
Real-Time Operating System (RTOS) will provide a device abstraction to access 
the device resource. When using these abstractions, TVM needs to understand how 
to interact with a device for a given platform.

Taking the common example of a UART interface (imagining the accelerator is 
communicated to via this interface); in Zephyr, this would look similar to:

```c
#include <zephyr.h>
#include <device.h>

struct device *uart_dev = device_get_binding("USART0");

char data[] = "Hello World!\r\n";
uart_tx(uart_dev, data, sizeof(data), 100);
```

Whereas in CMSIS, this would look more similar to:

```c
ARM_DRIVER_USART* uart_dev = &Driver_USART0;
uart_dev->Initialize(NULL);

char data[] = "Hello World!\r\n";
uart_dev->Send(data, sizeof(data)/sizeof(data[0]));
```

In this example, you can see the diversity of RTOS implementations for drivers 
and why it's required to provide a flexible abstraction to pass devices for 
micro targets.

# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation

## User App
The `tvm_device_t`s are implemented for each RTOS or platform required, these 
are included by the user who chooses as appropriate for their application. 
Notably, to avoid dynamic allocation, the user must provide the `tvm_device_t` 
struct and initialise it rather than it being created and setup for them in the 
API.

```c
#include <tvm/runtime/device.h>
#include <tvm/platform/zephyr.h>

tvm_device_t accelerator; // Opaque type for accelerator device
TVMDeviceInit(accelerator);

// Platform specific call
TVMDevicePlatformBind(accelerator, ...platform specific parameters);

struct tvmgen_mynetwork_devices devices {
    .accelerator = accelerator
};

int32_t ret = tvmgen_mynetwork_run(
    ...,
    &devices
);

TVMDeviceDestroy(accelerator);
```

## Platform Structures
Users can take a implementations from `src/runtime/crt/platform` and headers 
from `include/runtime/crt/platform` which maps to their platform device 
implementation. In the case of a bare metal environment, this would default to 
a void pointer as there's no information available.

```c
typedef tvm_device_t void*;
```

For RTOS implementations, a structure can be created such as this simple Zephyr 
wrapper (include/runtime/crt/platform/zephyr.h):

```c
#include <device.h>

typedef struct {
    struct device* dev;
} tvm_device_t;
```

This enables the OS maximum control over the resources required and provides 
the opportunity to craft code in whichever way is most idiomatic for that 
platform, such as if an additional locking mechanism is required:

```c
#include <device.h>
#include <kernel.h>

typedef struct {
    struct device* dev;
    k_mutex lock;
} tvm_device_t;
```

## Generic Device API
The majority of the device API calls should be added to `c_backend_api.h`:
```c
int32_t TVMDeviceInit(tvm_device_t* tvm_dev);
int32_t TVMDeviceOpen(tvm_device_t* tvm_dev);
int32_t TVMDeviceClose(tvm_device_t* tvm_dev);
int32_t TVMDeviceDestroy(tvm_device_t* tvm_dev);
```

These can all be implemented using the user-opaque context `tvm_device_t`, 
enabling the majority of TVM code to be portable between RTOS implementations; 
importantly this applies to those called within operator functions (see below). 
`c_backend_api.h` can then include the relevant `platform/<PLATFORM>.h` file 
where appropriate using `#ifdef` - if this becomes too unruly it can be added 
to `c_device_api.h` or similar.

## Platform Device API
To allow setting of platform specifics into the opaque struct, these should be 
defined in the platform header. Alongside the header, an additional file will 
provide implementations (`src/runtime/crt/platform/zephyr.c`):
```c
int32_t TVMDevicePlatformBind(tvm_device_t* tvm_dev, struct device* zephyr_dev);

int32_t TVMDevicePlatformBind(tvm_device_t* tvm_dev, struct device* zephyr_dev) 
{
    tvm_dev->device = zephyr_dev;
}
```
This simple wrapper enables type checking of these functions and defining a 
clear translation boundary between the underlying OS implementation and TVM.

# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation

## Entrypoint
The entrypoint API defined in [Embedded C Runtime 
Interface](https://discuss.tvm.apache.org/t/rfc-utvm-embedded-c-runtime-interface/9951)
 is augmented with the `devices` structure which contains implemented 
`tvm_device_t` `struct`s for each device used by the network. These are re-cast 
to `void *` when entering the AOT main function to pass it without TIR 
understanding the struct types.

```c
int32_t tvmgen_mynetwork_run(
    ...,
    struct tvmgen_mynetwork_devices* devices
) {
    tvmgen_mynetwork_run_model(
        ...,
        devices->host,
        devices->accelerator
    );
}
```

## Executor Function
Each operator is provided with a single device object which can be abstracted 
and passed as the `void* resource_handle`. The main function calls into the 
device API to setup and teardown resources before and after each operator call.

```c
int32_t tvmgen_mynetwork_run_model(..., device0, device1) {
    TVMDeviceOpen(device0); // Could reserve or enable certain circuitry
    operator(device0);
    TVMDeviceClose(device0);

    TVMDeviceOpen(device1);
    operator(device1);
    TVMDeviceClose(device1);
}
```

## Device API Functions
In the example of Zephyr, devices are already a first class concept so many of 
the functions will no-op but should synchronisation be required, an example 
implementation could be:

```c
#include <device.h>

typedef struct {
    struct device* dev;
    k_mutex lock;
} tvm_device_t;

int32_t TVMDeviceInit(tvm_device_t* tvm_dev) {
    k_mutex_init(&tvm_dev->lock);
}

// Platform-specific
int32_t TVMDevicePlatformBind(tvm_device_t* tvm_dev, struct device* zephyr_dev) 
{
    tvm_dev->dev = zephyr_dev;
}

int32_t TVMDeviceOpen(tvm_device_t* tvm_dev) {
    k_mutex_lock(&tvm_dev->lock, K_FOREVER);
}
int32_t TVMDeviceClose(tvm_device_t* tvm_dev) {
    k_mutex_unlock(&tvm_dev->lock);
}

int32_t TVMDeviceDestroy(tvm_device_t* tvm_dev) {
    tvm_dev->dev = NULL;
}
```

Whereas for CMSIS, you can use the platform-specific function to encapsulate 
the API to our imaginary UART accessed accelerator:

```c
typedef struct {
    void* dev;
} tvm_device_t;

int32_t TVMDeviceInit(tvm_device_t* tvm_dev) {}

// Platform-specific
int32_t TVMDevicePlatformBindUart(tvm_device_t* tvm_dev, ARM_DRIVER_USART* 
uart_dev) {
    uart_dev->Initialize(NULL);
    tvm_dev->dev = uart_dev;
}

int32_t TVMDeviceOpen(tvm_device_t* tvm_dev) {}
int32_t TVMDeviceClose(tvm_device_t* tvm_dev) {}
int32_t TVMDeviceDestroy(tvm_device_t* tvm_dev) {}
```

## Operator Usage
Each operator would be expected to utilise one device structure and be passed 
that as the `resource_handle` parameter, making the assumption that each 
operator or variant of an operator is only bound to one device at a time. In 
the following example it can be seen how a accelerators interface is 
implemented per platform to take this void pointer and call the platform 
specific driver code.

```c
// Operator takes opaque resource_handle
int32_t my_operator(..., void* resource_handle) {
    if (TVMMyAcceleratorInvoke(resource_handle, ...ins,outs,params...) != 0) {
        return -1;
    }
}

// Platform implementation
int32_t TVMMyAcceleratorInvoke(struct device* zephyr_dev) {
    my_accelerator_invoke(
        zephyr_dev,
        ...ins,outs,params...
    );
}
```

## PrimFunc Resource Handle
A `tir::Var` is added to `PrimFunc` in `include/tvm/tir/function.h` which 
enables a `PrimFunc` to track and use the `resource_handle` parameter. This 
will be used by both unpacked and packed APIs to pass the resource down without 
packing into `TVMValue`, instead as a `void *`. 

When this is packed in the lowering phase, the `resource_handle` will be 
assumed to exist as the last argument after being provided by the executor code 
generation. The eventual `Call` returned in `lower_tvm_builtin.c` contains the 
`resource_handle` by removing this final argument:

```cpp
auto arg_count = op->args.size() - 1;
resource_handle = op->args[arg_count];

// ... packing using arg_count reduced by one

return Call(
    op->dtype,
    call_cpacked_lowered(), 
    {
        op->args[0],
        scope.stack_value,
        scope.stack_tcode,
        ConstInt32(arg_stack_begin),
        ConstInt32(arg_stack_begin + op->args.size() - 1),
        resource_handle
    }
);
```

## Device Discovery
Initially, devices will be defined by Target name or external compiler name. 
This means if you mark an operator as needing an external `woofles` compiler it 
would result in a devices struct such as:

```c
struct tvmgen_my_model_devices {
    tvm_device_t* woofles
};
```

Which would be passed down to the relevant operators via the executor. This 
applies similarly to `Target` defined devices.

# Drawbacks
[drawbacks]: #drawbacks

* Current limitations with `Target` and external compilers mean that only one 
of each name can occur at once using this system, this could equally be in 
future work.
* The initial assumption is that each operator will be mapped to a single 
device, this design choice means that fusion across devices will not be 
possible.

# Rationale and alternatives
[rationale-and-alternatives]: #rationale-and-alternatives

We could leverage more code generation to generate device structures. It is the 
authors belief that being able to write small self-contained platform 
implementations will be easier to understand for both users and developers of 
TVM.

Another route to take is to treat RTOSes as entirely separate from TVM, 
requiring them to fully configure resources before passing in the `void*`. This 
removes TVMs ability to add hooks for resource management such as `open` and 
`close` which could be used to enable/disable entire pieces of circuitry 
between operators.

# Prior art
[prior-art]: #prior-art
* Uses the existing `resource_handle` in the TVM code which isn't currently 
propagated
* Extends the C Interface API to add support for devices
* Resource management using `open`/`close` and `init`/`destroy` alongside 
opaque handles is a common pattern in C libraries

# Unresolved questions
[unresolved-questions]: #unresolved-questions

# Future possibilities
[future-possibilities]: #future-possibilities
This RFC aims to put in place the foundation of the Device API to start 
abstracting the various RTOS drivers. There are other flows that have been 
considered as extensions to this.

## Memory Copies
Movement of memory between additional devices which may be unable to 
communicate directly, this could take the form of simply:

```
// Copy from/to
int32_t TVMDeviceCopyFrom(tvm_device_t* source, void* destination);
int32_t TVMDeviceCopyTo(void* source, tvm_device_t* destination);
```

And be integrated into the flow as follows:

```
TVMDeviceOpen(device1);
operator(..., device1) {
    // some work where device1 can read from memory directly
    // then the result is copied back
    TVMDeviceCopyFrom(device1, &buffer);
}
TVMDeviceClose(device1);

TVMDeviceOpen(device2);
operator(..., device2) 
    TVMDeviceCopyTo(&buffer, device2);{
    // some which only device2 can see
    TVMDeviceCopyFrom(device2, &output);
}
TVMDeviceClose(device1);
```

The additional operations here require further thought, but the `Open`/`Close` 
API wrapper demonstrated supports it as an extension. Moving some of these 
calls into the executor may also enable asynchronous memories copies from 
within TVM.





---
[Visit Topic](https://discuss.tvm.apache.org/t/pre-rfc-c-device-api/10874/1) to 
respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.tvm.apache.org/email/unsubscribe/d214f6420fe95a2148b81cb6db4dea1e3761edba9c7aebec78852b3d3f926517).

Reply via email to