From: Elena Agostini <[email protected]> This is the CUDA implementation of the gpudev library. Funcitonalities implemented through CUDA Driver API are:
- Device probe and remove - Manage device memory allocations - Register/unregister external CPU memory in the device memory area Signed-off-by: Elena Agostini <[email protected]> --- doc/guides/gpus/cuda.rst | 127 +++ doc/guides/gpus/index.rst | 1 + doc/guides/rel_notes/release_21_11.rst | 2 + drivers/gpu/cuda/cuda.c | 1132 ++++++++++++++++++++++++ drivers/gpu/cuda/cuda_loader.h | 301 +++++++ drivers/gpu/cuda/meson.build | 10 + drivers/gpu/cuda/version.map | 3 + drivers/gpu/meson.build | 2 +- 8 files changed, 1577 insertions(+), 1 deletion(-) create mode 100644 doc/guides/gpus/cuda.rst create mode 100644 drivers/gpu/cuda/cuda.c create mode 100644 drivers/gpu/cuda/cuda_loader.h create mode 100644 drivers/gpu/cuda/meson.build create mode 100644 drivers/gpu/cuda/version.map diff --git a/doc/guides/gpus/cuda.rst b/doc/guides/gpus/cuda.rst new file mode 100644 index 0000000000..313fcfeffc --- /dev/null +++ b/doc/guides/gpus/cuda.rst @@ -0,0 +1,127 @@ +.. SPDX-License-Identifier: BSD-3-Clause + Copyright (c) 2021 NVIDIA Corporation & Affiliates + +CUDA GPU driver +=============== + +The CUDA GPU driver library (**librte_gpu_cuda**) provides support for NVIDIA GPUs. +Information and documentation about these devices can be found on the +`NVIDIA website <http://www.nvidia.com>`__. Help is also provided by the +`NVIDIA CUDA Toolkit developer zone <https://docs.nvidia.com/cuda>`__. + +CUDA Shared Library +------------------- + +To avoid any system configuration issue, the CUDA API **libcuda.so** shared library +is not linked at building time because of a Meson's bug that looks +for `cudart` module even if the `meson.build` file only requires default `cuda` module. + +**libcuda.so** is loaded at runtime in the ``cuda_gpu_probe`` function through ``dlopen`` +when the very first GPU is detected. +If your CUDA installation resides in a custom directory you need to set +the environment variable ``CUDA_PATH`` to specify where ``dlopen`` +can look for your **libcuda.so**. + +All CUDA API symbols are loaded at runtime as well. +For this reason, to build the CUDA driver library +you don't need to have the CUDA library installed on your system. + +Design +------ + +**librte_gpu_cuda** relies on CUDA Driver API (no need for CUDA Runtime API). + +Goal of this driver library is not to provide a wrapper for the whole CUDA Driver API. +Instead, the scope is to implement the generic features of gpudev API. +For a CUDA application, integrating the gpudev library functions using the CUDA driver library +is quite straightforward and doesn't create any compatibility problem. + +Initialization +~~~~~~~~~~~~~~ + +During initialization, CUDA driver library detects NVIDIA physical GPUs on the +system or specified via EAL device options (e.g. ``-a b6:00.0``). +The driver initializes the CUDA driver environment through ``cuInit(0)`` function. +For this reason, it's required to set any CUDA environment configuration before +calling ``rte_eal_init`` function in the DPDK application. + +If the CUDA driver environment has been already initialized, the ``cuInit(0)`` +in CUDA driver library has no effect. + +CUDA Driver sub-contexts +~~~~~~~~~~~~~~~~~~~~~~~~ + +After initialization, a CUDA application can create multiple sub-contexts on GPU +physical devices. Through gpudev library, is possible to register these sub-contexts +in the CUDA driver library as child devices having as parent a GPU physical device. + +CUDA driver library also supports `MPS <https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf>`__. + +GPU memory management +~~~~~~~~~~~~~~~~~~~~~ + +The CUDA driver library maintains a table of GPU memory addresses allocated +and CPU memory addresses registered associated to the input CUDA context. +Whenever the application tried to deallocate or deregister a memory address, +if the address is not in the table the CUDA driver library will return an error. + +Features +-------- + +- Register new child devices aka new CUDA Driver contexts +- Allocate memory on the GPU +- Register CPU memory to make it visible from GPU + +Minimal requirements +-------------------- + +Minimal requirements to enable the CUDA driver library are: + +- NVIDIA GPU Ampere or Volta +- CUDA 11.4 Driver API or newer + +`GPUDirect RDMA Technology <https://docs.nvidia.com/cuda/gpudirect-rdma/index.html>`__ +allows compatible network cards (e.g. Mellanox) to directly send and receive packets +using GPU memory instead of additional memory copies through the CPU system memory. +To enable this technology, system requirements are: + +- `nvidia-peermem <https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#nvidia-peermem>`__ module running on the system +- Mellanox Network card ConnectX-5 or newer (BlueField models included) +- DPDK mlx5 PMD enabled +- To reach the best performance, an additional PCIe switch between GPU and NIC is recommended + +Limitations +----------- + +Supported only on Linux. + +Supported GPUs +-------------- + +The following NVIDIA GPU devices are supported by this CUDA driver library: + +- NVIDIA A100 80GB PCIe +- NVIDIA A100 40GB PCIe +- NVIDIA A30 24GB +- NVIDIA A10 24GB +- NVIDIA V100 32GB PCIe +- NVIDIA V100 16GB PCIe + +External references +------------------- + +A good example of how to use the GPU CUDA driver library through the gpudev library +is the l2fwd-nv application that can be found `here <https://github.com/NVIDIA/l2fwd-nv>`__. + +The application is based on vanilla DPDK example l2fwd and it's enhanced with GPU memory +managed through gpudev library and CUDA to launch the swap of packets' MAC addresses workload +on the GPU. + +l2fwd-nv is not intended to be used for performance (testpmd is the good candidate for this). +The goal is to show different use-cases about how a CUDA application can use DPDK to: + +- allocate memory on GPU device using gpudev library +- use that memory to create an external GPU memory mempool +- receive packets directly in GPU memory +- coordinate the workload on the GPU with the network and CPU activity to receive packets +- send modified packets directly from the GPU memory diff --git a/doc/guides/gpus/index.rst b/doc/guides/gpus/index.rst index 1878423239..4b7a420556 100644 --- a/doc/guides/gpus/index.rst +++ b/doc/guides/gpus/index.rst @@ -9,3 +9,4 @@ General-Purpose Graphics Processing Unit Drivers :numbered: overview + cuda diff --git a/doc/guides/rel_notes/release_21_11.rst b/doc/guides/rel_notes/release_21_11.rst index 7d60b554d8..c628deaeea 100644 --- a/doc/guides/rel_notes/release_21_11.rst +++ b/doc/guides/rel_notes/release_21_11.rst @@ -111,6 +111,8 @@ New Features * Memory management * Communication flag & list +* **Added NVIDIA GPU driver implemented with CUDA library.** + * **Added new RSS offload types for IPv4/L4 checksum in RSS flow.** Added macros ETH_RSS_IPV4_CHKSUM and ETH_RSS_L4_CHKSUM, now IPv4 and diff --git a/drivers/gpu/cuda/cuda.c b/drivers/gpu/cuda/cuda.c new file mode 100644 index 0000000000..4f60c1932d --- /dev/null +++ b/drivers/gpu/cuda/cuda.c @@ -0,0 +1,1132 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright (c) 2021 NVIDIA Corporation & Affiliates + */ + +#include <rte_common.h> +#include <rte_log.h> +#include <rte_malloc.h> +#include <rte_errno.h> +#include <rte_pci.h> +#include <rte_bus_pci.h> +#include <rte_byteorder.h> +#include <rte_dev.h> + +#include <gpudev_driver.h> +#include "cuda_loader.h" +#include <dlfcn.h> + +#define CUDA_DRIVER_MIN_VERSION 11040 +#define CUDA_API_MIN_VERSION 3020 + +/* CUDA Driver functions loaded with dlsym() */ +enum cuError (*sym_cuInit)(unsigned int flags) = NULL; +enum cuError (*sym_cuDriverGetVersion)(int *driverVersion) = NULL; +enum cuError (*sym_cuGetProcAddress)(const char *symbol, void **pfn, int cudaVersion, uint64_t flags) = NULL; + +/* CUDA Driver functions loaded with cuGetProcAddress for versioning */ +PFN_cuGetErrorString pfn_cuGetErrorString; +PFN_cuGetErrorName pfn_cuGetErrorName; +PFN_cuPointerSetAttribute pfn_cuPointerSetAttribute; +PFN_cuDeviceGetAttribute pfn_cuDeviceGetAttribute; +PFN_cuDeviceGetByPCIBusId pfn_cuDeviceGetByPCIBusId; +PFN_cuDevicePrimaryCtxRetain pfn_cuDevicePrimaryCtxRetain; +PFN_cuDevicePrimaryCtxRelease pfn_cuDevicePrimaryCtxRelease; +PFN_cuDeviceTotalMem pfn_cuDeviceTotalMem; +PFN_cuDeviceGetName pfn_cuDeviceGetName; +PFN_cuCtxGetApiVersion pfn_cuCtxGetApiVersion; +PFN_cuCtxSetCurrent pfn_cuCtxSetCurrent; +PFN_cuCtxGetCurrent pfn_cuCtxGetCurrent; +PFN_cuCtxGetDevice pfn_cuCtxGetDevice; +PFN_cuCtxGetExecAffinity pfn_cuCtxGetExecAffinity; +PFN_cuMemAlloc pfn_cuMemAlloc; +PFN_cuMemFree pfn_cuMemFree; +PFN_cuMemHostRegister pfn_cuMemHostRegister; +PFN_cuMemHostUnregister pfn_cuMemHostUnregister; +PFN_cuMemHostGetDevicePointer pfn_cuMemHostGetDevicePointer; +PFN_cuFlushGPUDirectRDMAWrites pfn_cuFlushGPUDirectRDMAWrites; + +static void *cudalib; +static unsigned int cuda_api_version; +static int cuda_driver_version; + +/* NVIDIA GPU vendor */ +#define NVIDIA_GPU_VENDOR_ID (0x10de) + +/* NVIDIA GPU device IDs */ +#define NVIDIA_GPU_A100_40GB_DEVICE_ID (0x20f1) +#define NVIDIA_GPU_A100_80GB_DEVICE_ID (0x20b5) + +#define NVIDIA_GPU_A30_24GB_DEVICE_ID (0x20b7) +#define NVIDIA_GPU_A10_24GB_DEVICE_ID (0x2236) + +#define NVIDIA_GPU_V100_32GB_DEVICE_ID (0x1db6) +#define NVIDIA_GPU_V100_16GB_DEVICE_ID (0x1db4) + +#define CUDA_MAX_ALLOCATION_NUM 512 + +#define GPU_PAGE_SHIFT 16 +#define GPU_PAGE_SIZE (1UL << GPU_PAGE_SHIFT) + +static RTE_LOG_REGISTER_DEFAULT(cuda_logtype, NOTICE); + +/** Helper macro for logging */ +#define rte_gpu_cuda_log(level, fmt, ...) \ + rte_log(RTE_LOG_ ## level, cuda_logtype, fmt "\n", ##__VA_ARGS__) + +#define rte_gpu_cuda_log_debug(fmt, ...) \ + rte_gpu_cuda_log(DEBUG, RTE_STR(__LINE__) ":%s() " fmt, __func__, \ + ##__VA_ARGS__) + +/* NVIDIA GPU address map */ +static struct rte_pci_id pci_id_cuda_map[] = { + { + RTE_PCI_DEVICE(NVIDIA_GPU_VENDOR_ID, + NVIDIA_GPU_A100_40GB_DEVICE_ID) + }, + { + RTE_PCI_DEVICE(NVIDIA_GPU_VENDOR_ID, + NVIDIA_GPU_V100_32GB_DEVICE_ID) + }, + /* {.device_id = 0}, ?? */ +}; + +/* Device private info */ +struct cuda_info { + char gpu_name[RTE_DEV_NAME_MAX_LEN]; + cuDev cu_dev; + int gdr_supported; + int gdr_write_ordering; + int gdr_flush_type; +}; + +/* Type of memory allocated by CUDA driver */ +enum mem_type { + GPU_MEM = 0, + CPU_REGISTERED, + GPU_REGISTERED /* Not used yet */ +}; + +/* key associated to a memory address */ +typedef uintptr_t cuda_ptr_key; + +/* Single entry of the memory list */ +struct mem_entry { + cuDevPtr ptr_d; + void *ptr_h; + size_t size; + struct rte_gpu *dev; + CUcontext ctx; + cuda_ptr_key pkey; + enum mem_type mtype; + struct mem_entry *prev; + struct mem_entry *next; +}; + +static struct mem_entry *mem_alloc_list_head; +static struct mem_entry *mem_alloc_list_tail; +static uint32_t mem_alloc_list_last_elem; + +/* Load the CUDA symbols */ + +static int +cuda_loader(void) +{ + char cuda_path[1024]; + + if (!getenv("CUDA_PATH")) + snprintf(cuda_path, 1024, "%s", "libcuda.so"); + else + snprintf(cuda_path, 1024, "%s%s", getenv("CUDA_PATH"), "libcuda.so"); + + cudalib = dlopen(cuda_path, RTLD_LAZY); + if (cudalib == NULL) { + rte_gpu_cuda_log(ERR, "Failed to find CUDA library in %s (CUDA_PATH=%s).\n", + cuda_path, getenv("CUDA_PATH")); + return -1; + } + + return 0; +} + +static int +cuda_sym_func_loader(void) +{ + if (!cudalib) + return -1; + + sym_cuInit = dlsym(cudalib, "cuInit"); + if (sym_cuInit == NULL) { + rte_gpu_cuda_log(ERR, "Failed to load CUDA missing symbol cuInit\n"); + return -1; + } + + sym_cuDriverGetVersion = dlsym(cudalib, "cuDriverGetVersion"); + if (sym_cuDriverGetVersion == NULL) { + rte_gpu_cuda_log(ERR, "Failed to load CUDA missing symbol cuDriverGetVersion\n"); + return -1; + } + + sym_cuGetProcAddress = dlsym(cudalib, "cuGetProcAddress"); + if (sym_cuGetProcAddress == NULL) { + rte_gpu_cuda_log(ERR, "Failed to load CUDA missing symbol cuGetProcAddress\n"); + return -1; + } + + return 0; +} + +static int +cuda_pfn_func_loader(void) +{ + enum cuError res; + + res = sym_cuGetProcAddress("cuGetErrorString", (void **) (&pfn_cuGetErrorString), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuGetErrorString failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuGetErrorName", (void **) (&pfn_cuGetErrorName), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuGetErrorName failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuPointerSetAttribute", (void **) (&pfn_cuPointerSetAttribute), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuPointerSetAttribute failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuDeviceGetAttribute", (void **) (&pfn_cuDeviceGetAttribute), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuDeviceGetAttribute failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuDeviceGetByPCIBusId", (void **) (&pfn_cuDeviceGetByPCIBusId), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuDeviceGetByPCIBusId failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuDeviceGetName", (void **) (&pfn_cuDeviceGetName), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuDeviceGetName failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuDevicePrimaryCtxRetain", (void **) (&pfn_cuDevicePrimaryCtxRetain), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuDevicePrimaryCtxRetain failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuDevicePrimaryCtxRelease", (void **) (&pfn_cuDevicePrimaryCtxRelease), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuDevicePrimaryCtxRelease failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuDeviceTotalMem", (void **) (&pfn_cuDeviceTotalMem), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuDeviceTotalMem failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuCtxGetApiVersion", (void **) (&pfn_cuCtxGetApiVersion), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuCtxGetApiVersion failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuCtxGetDevice", (void **) (&pfn_cuCtxGetDevice), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuCtxGetDevice failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuCtxSetCurrent", (void **) (&pfn_cuCtxSetCurrent), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuCtxSetCurrent failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuCtxGetCurrent", (void **) (&pfn_cuCtxGetCurrent), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuCtxGetCurrent failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuCtxGetExecAffinity", (void **) (&pfn_cuCtxGetExecAffinity), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuCtxGetExecAffinity failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuMemAlloc", (void **) (&pfn_cuMemAlloc), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuMemAlloc failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuMemFree", (void **) (&pfn_cuMemFree), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuMemFree failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuMemHostRegister", (void **) (&pfn_cuMemHostRegister), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuMemHostRegister failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuMemHostUnregister", (void **) (&pfn_cuMemHostUnregister), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuMemHostUnregister failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuMemHostGetDevicePointer", (void **) (&pfn_cuMemHostGetDevicePointer), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve pfn_cuMemHostGetDevicePointer failed with %d\n", res); + return -1; + } + + res = sym_cuGetProcAddress("cuFlushGPUDirectRDMAWrites", (void **) (&pfn_cuFlushGPUDirectRDMAWrites), cuda_driver_version, 0); + if (res != 0) { + rte_gpu_cuda_log(ERR, "Retrieve cuFlushGPUDirectRDMAWrites failed with %d\n", res); + return -1; + } + + return 0; +} + +/* Generate a key from a memory pointer */ +static cuda_ptr_key +get_hash_from_ptr(void *ptr) +{ + return (uintptr_t) ptr; +} + +static uint32_t +mem_list_count_item(void) +{ + return mem_alloc_list_last_elem; +} + +/* Initiate list of memory allocations if not done yet */ +static struct mem_entry * +mem_list_add_item(void) +{ + /* Initiate list of memory allocations if not done yet */ + if (mem_alloc_list_head == NULL) { + mem_alloc_list_head = rte_zmalloc(NULL, + sizeof(struct mem_entry), + RTE_CACHE_LINE_SIZE); + if (mem_alloc_list_head == NULL) { + rte_gpu_cuda_log(ERR, "Failed to allocate memory for memory list.\n"); + return NULL; + } + + mem_alloc_list_head->next = NULL; + mem_alloc_list_head->prev = NULL; + mem_alloc_list_tail = mem_alloc_list_head; + } else { + struct mem_entry *mem_alloc_list_cur = rte_zmalloc(NULL, + sizeof(struct mem_entry), + RTE_CACHE_LINE_SIZE); + + if (mem_alloc_list_cur == NULL) { + rte_gpu_cuda_log(ERR, "Failed to allocate memory for memory list.\n"); + return NULL; + } + + mem_alloc_list_tail->next = mem_alloc_list_cur; + mem_alloc_list_cur->prev = mem_alloc_list_tail; + mem_alloc_list_tail = mem_alloc_list_tail->next; + mem_alloc_list_tail->next = NULL; + } + + mem_alloc_list_last_elem++; + + return mem_alloc_list_tail; +} + +static struct mem_entry * +mem_list_find_item(cuda_ptr_key pk) +{ + struct mem_entry *mem_alloc_list_cur = NULL; + + if (mem_alloc_list_head == NULL) { + rte_gpu_cuda_log(ERR, "Memory list doesn't exist\n"); + return NULL; + } + + if (mem_list_count_item() == 0) { + rte_gpu_cuda_log(ERR, "No items in memory list\n"); + return NULL; + } + + mem_alloc_list_cur = mem_alloc_list_head; + + while (mem_alloc_list_cur != NULL) { + if (mem_alloc_list_cur->pkey == pk) + return mem_alloc_list_cur; + mem_alloc_list_cur = mem_alloc_list_cur->next; + } + + return mem_alloc_list_cur; +} + +static int +mem_list_del_item(cuda_ptr_key pk) +{ + struct mem_entry *mem_alloc_list_cur = NULL; + + mem_alloc_list_cur = mem_list_find_item(pk); + if (mem_alloc_list_cur == NULL) + return -EINVAL; + + /* if key is in head */ + if (mem_alloc_list_cur->prev == NULL) + mem_alloc_list_head = mem_alloc_list_cur->next; + else { + mem_alloc_list_cur->prev->next = mem_alloc_list_cur->next; + if (mem_alloc_list_cur->next != NULL) + mem_alloc_list_cur->next->prev = mem_alloc_list_cur->prev; + } + + rte_free(mem_alloc_list_cur); + + mem_alloc_list_last_elem--; + + return 0; +} + +static int +cuda_dev_info_get(struct rte_gpu *dev, struct rte_gpu_info *info) +{ + int ret = 0; + enum cuError res; + struct rte_gpu_info parent_info; + struct cuExecAffinityParams affinityPrm; + const char *err_string; + struct cuda_info *private; + CUcontext current_ctx; + CUcontext input_ctx; + + if (dev == NULL) + return -EINVAL; + + /* Child initialization time probably called by rte_gpu_add_child() */ + if (dev->mpshared->info.parent != RTE_GPU_ID_NONE && dev->mpshared->dev_private == NULL) { + /* Store current ctx */ + res = pfn_cuCtxGetCurrent(¤t_ctx); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, "cuCtxGetCurrent failed with %s.\n", err_string); + + return -1; + } + + /* Set child ctx as current ctx */ + input_ctx = (CUcontext)((uintptr_t)dev->mpshared->info.context); + res = pfn_cuCtxSetCurrent(input_ctx); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuCtxSetCurrent input failed with %s.\n", + err_string); + + return -1; + } + + /* + * Ctx capacity info + */ + + /* MPS compatible */ + res = pfn_cuCtxGetExecAffinity(&affinityPrm, CU_EXEC_AFFINITY_TYPE_SM_COUNT); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, "cuCtxGetExecAffinity failed with %s.\n", err_string); + } + dev->mpshared->info.processor_count = (uint32_t)affinityPrm.param.smCount.val; + + ret = rte_gpu_info_get(dev->mpshared->info.parent, &parent_info); + if (ret) + return -ENODEV; + dev->mpshared->info.total_memory = parent_info.total_memory; + + /* + * GPU Device private info + */ + dev->mpshared->dev_private = rte_zmalloc(NULL, + sizeof(struct cuda_info), + RTE_CACHE_LINE_SIZE); + if (dev->mpshared->dev_private == NULL) { + rte_gpu_cuda_log(ERR, "Failed to allocate memory for GPU process private.\n"); + + return -1; + } + + private = (struct cuda_info *)dev->mpshared->dev_private; + + res = pfn_cuCtxGetDevice(&(private->cu_dev)); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, "cuCtxGetDevice failed with %s.\n", err_string); + + return -1; + } + + res = pfn_cuDeviceGetName(private->gpu_name, RTE_DEV_NAME_MAX_LEN, private->cu_dev); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, "cuDeviceGetName failed with %s.\n", err_string); + + return -1; + } + + /* Restore original ctx as current ctx */ + res = pfn_cuCtxSetCurrent(current_ctx); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuCtxSetCurrent current failed with %s.\n", + err_string); + + return -1; + } + } + + *info = dev->mpshared->info; + + return 0; +} + +/* + * GPU Memory + */ + +static int +cuda_mem_alloc(struct rte_gpu *dev, size_t size, void **ptr) +{ + enum cuError res; + const char *err_string; + CUcontext current_ctx; + CUcontext input_ctx; + unsigned int flag = 1; + + if (dev == NULL || size == 0) + return -EINVAL; + + /* Store current ctx */ + res = pfn_cuCtxGetCurrent(¤t_ctx); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, "cuCtxGetCurrent failed with %s.\n", err_string); + + return -1; + } + + /* Set child ctx as current ctx */ + input_ctx = (CUcontext)((uintptr_t)dev->mpshared->info.context); + res = pfn_cuCtxSetCurrent(input_ctx); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, "cuCtxSetCurrent input failed with %s.\n", err_string); + + return -1; + } + + /* Get next memory list item */ + mem_alloc_list_tail = mem_list_add_item(); + if (mem_alloc_list_tail == NULL) + return -ENOMEM; + + /* Allocate memory */ + mem_alloc_list_tail->size = size; + res = pfn_cuMemAlloc(&(mem_alloc_list_tail->ptr_d), mem_alloc_list_tail->size); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuCtxSetCurrent current failed with %s.\n", + err_string); + + return -1; + } + + /* GPUDirect RDMA attribute required */ + res = pfn_cuPointerSetAttribute(&flag, + CU_PTR_ATTR_SYNC_MEMOPS, + mem_alloc_list_tail->ptr_d); + if (res != 0) { + rte_gpu_cuda_log(ERR, + "Could not set SYNC MEMOP attribute for GPU memory at %"PRIu32", err %d\n", + (uint32_t) mem_alloc_list_tail->ptr_d, res); + return -1; + } + + mem_alloc_list_tail->pkey = get_hash_from_ptr((void *) mem_alloc_list_tail->ptr_d); + mem_alloc_list_tail->ptr_h = NULL; + mem_alloc_list_tail->size = size; + mem_alloc_list_tail->dev = dev; + mem_alloc_list_tail->ctx = (CUcontext)((uintptr_t)dev->mpshared->info.context); + mem_alloc_list_tail->mtype = GPU_MEM; + + /* Restore original ctx as current ctx */ + res = pfn_cuCtxSetCurrent(current_ctx); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, "cuCtxSetCurrent current failed with %s.\n", err_string); + + return -1; + } + + *ptr = (void *) mem_alloc_list_tail->ptr_d; + + return 0; +} + +static int +cuda_mem_register(struct rte_gpu *dev, size_t size, void *ptr) +{ + enum cuError res; + const char *err_string; + CUcontext current_ctx; + CUcontext input_ctx; + unsigned int flag = 1; + int use_ptr_h = 0; + + if (dev == NULL || size == 0 || ptr == NULL) + return -EINVAL; + + /* Store current ctx */ + res = pfn_cuCtxGetCurrent(¤t_ctx); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, "cuCtxGetCurrent failed with %s.\n", err_string); + + return -1; + } + + /* Set child ctx as current ctx */ + input_ctx = (CUcontext)((uintptr_t)dev->mpshared->info.context); + res = pfn_cuCtxSetCurrent(input_ctx); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, "cuCtxSetCurrent input failed with %s.\n", err_string); + + return -1; + } + + /* Get next memory list item */ + mem_alloc_list_tail = mem_list_add_item(); + if (mem_alloc_list_tail == NULL) + return -ENOMEM; + + /* Allocate memory */ + mem_alloc_list_tail->size = size; + mem_alloc_list_tail->ptr_h = ptr; + + res = pfn_cuMemHostRegister(mem_alloc_list_tail->ptr_h, + mem_alloc_list_tail->size, + CU_MHOST_REGISTER_PORTABLE | CU_MHOST_REGISTER_DEVICEMAP); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuMemHostRegister failed with %s ptr %p size %zd.\n", + err_string, mem_alloc_list_tail->ptr_h, mem_alloc_list_tail->size); + + return -1; + } + + res = pfn_cuDeviceGetAttribute(&(use_ptr_h), + CU_DEV_ATTR_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM, + ((struct cuda_info *)(dev->mpshared->dev_private))->cu_dev); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, "cuDeviceGetAttribute failed with %s.\n", + err_string + ); + + return -1; + } + + if (use_ptr_h == 0) { + res = pfn_cuMemHostGetDevicePointer(&(mem_alloc_list_tail->ptr_d), + mem_alloc_list_tail->ptr_h, + 0); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuMemHostGetDevicePointer failed with %s.\n", + err_string); + + return -1; + } + + if ((uintptr_t) mem_alloc_list_tail->ptr_d != (uintptr_t) mem_alloc_list_tail->ptr_h) { + rte_gpu_cuda_log(ERR, + "Host input pointer is different wrt GPU registered pointer\n"); + return -1; + } + } else { + mem_alloc_list_tail->ptr_d = (cuDevPtr) mem_alloc_list_tail->ptr_h; + } + + /* GPUDirect RDMA attribute required */ + res = pfn_cuPointerSetAttribute(&flag, + CU_PTR_ATTR_SYNC_MEMOPS, + mem_alloc_list_tail->ptr_d); + if (res != 0) { + rte_gpu_cuda_log(ERR, + "Could not set SYNC MEMOP attribute for GPU memory at %"PRIu32", err %d\n", + (uint32_t) mem_alloc_list_tail->ptr_d, res); + return -1; + } + + mem_alloc_list_tail->pkey = get_hash_from_ptr((void *) mem_alloc_list_tail->ptr_h); + mem_alloc_list_tail->size = size; + mem_alloc_list_tail->dev = dev; + mem_alloc_list_tail->ctx = (CUcontext)((uintptr_t)dev->mpshared->info.context); + mem_alloc_list_tail->mtype = CPU_REGISTERED; + + /* Restore original ctx as current ctx */ + res = pfn_cuCtxSetCurrent(current_ctx); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuCtxSetCurrent current failed with %s.\n", + err_string); + + return -1; + } + + return 0; +} + +static int +cuda_mem_free(struct rte_gpu *dev, void *ptr) +{ + enum cuError res; + struct mem_entry *mem_item; + const char *err_string; + cuda_ptr_key hk; + + if (dev == NULL || ptr == NULL) + return -EINVAL; + + hk = get_hash_from_ptr((void *) ptr); + + mem_item = mem_list_find_item(hk); + if (mem_item == NULL) { + rte_gpu_cuda_log(ERR, "Memory address 0x%p not found in driver memory\n", ptr); + return -1; + } + + if (mem_item->mtype == GPU_MEM) { + res = pfn_cuMemFree(mem_item->ptr_d); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, "cuMemFree current failed with %s.\n", err_string); + + return -1; + } + + return mem_list_del_item(hk); + } + + rte_gpu_cuda_log(ERR, "Memory type %d not supported\n", mem_item->mtype); + return -1; +} + +static int +cuda_mem_unregister(struct rte_gpu *dev, void *ptr) +{ + enum cuError res; + struct mem_entry *mem_item; + const char *err_string; + cuda_ptr_key hk; + + if (dev == NULL || ptr == NULL) + return -EINVAL; + + hk = get_hash_from_ptr((void *) ptr); + + mem_item = mem_list_find_item(hk); + if (mem_item == NULL) { + rte_gpu_cuda_log(ERR, "Memory address 0x%p not nd in driver memory\n", ptr); + return -1; + } + + if (mem_item->mtype == CPU_REGISTERED) { + res = pfn_cuMemHostUnregister(ptr); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuMemHostUnregister current failed with %s.\n", + err_string); + + return -1; + } + + return mem_list_del_item(hk); + } + + rte_gpu_cuda_log(ERR, "Memory type %d not supported\n", mem_item->mtype); + return -1; +} + +static int +cuda_dev_close(struct rte_gpu *dev) +{ + if (dev == NULL) + return -EINVAL; + + rte_free(dev->mpshared->dev_private); + + return 0; +} + +static int +cuda_wmb(struct rte_gpu *dev) +{ + enum cuError res; + const char *err_string; + CUcontext current_ctx; + CUcontext input_ctx; + struct cuda_info *private; + + if (dev == NULL) + return -EINVAL; + + private = (struct cuda_info *)dev->mpshared->dev_private; + + if (private->gdr_write_ordering != CU_GDR_WRITES_ORDERING_NONE) { + /* + * No need to explicitly force the write ordering because + * the device natively supports it + */ + return 0; + } + + if (private->gdr_flush_type != CU_FLUSH_GDR_WRITES_OPTION_HOST) { + /* + * Can't flush GDR writes with cuFlushGPUDirectRDMAWrites CUDA function. + * Application needs to use alternative methods. + */ + return -ENOTSUP; + } + + /* Store current ctx */ + res = pfn_cuCtxGetCurrent(¤t_ctx); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, "cuCtxGetCurrent failed with %s.\n", err_string); + + return -1; + } + + /* Set child ctx as current ctx */ + input_ctx = (CUcontext)((uintptr_t)dev->mpshared->info.context); + res = pfn_cuCtxSetCurrent(input_ctx); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, "cuCtxSetCurrent input failed with %s.\n", err_string); + + return -1; + } + + res = pfn_cuFlushGPUDirectRDMAWrites(CU_FLUSH_GDR_WRITES_TARGET_CURRENT_CTX, + CU_FLUSH_GDR_WRITES_TO_ALL_DEVICES); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuFlushGPUDirectRDMAWrites current failed with %s.\n", + err_string); + + return -1; + } + + /* Restore original ctx as current ctx */ + res = pfn_cuCtxSetCurrent(current_ctx); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuCtxSetCurrent current failed with %s.\n", + err_string); + + return -1; + } + + return 0; +} + +static int +cuda_gpu_probe(__rte_unused struct rte_pci_driver *pci_drv, struct rte_pci_device *pci_dev) +{ + struct rte_gpu *dev = NULL; + enum cuError res; + cuDev cu_dev_id; + CUcontext pctx; + char dev_name[RTE_DEV_NAME_MAX_LEN]; + const char *err_string; + int processor_count = 0; + struct cuda_info *private; + + if (pci_dev == NULL) { + rte_gpu_cuda_log(ERR, "NULL PCI device"); + return -EINVAL; + } + + rte_pci_device_name(&pci_dev->addr, dev_name, sizeof(dev_name)); + + /* Allocate memory to be used privately by drivers */ + dev = rte_gpu_allocate(pci_dev->device.name); + if (dev == NULL) + return -ENODEV; + + /* Initialize values only for the first CUDA driver call */ + if (dev->mpshared->info.dev_id == 0) { + mem_alloc_list_head = NULL; + mem_alloc_list_tail = NULL; + mem_alloc_list_last_elem = 0; + + /* Load libcuda.so library */ + if (cuda_loader()) { + rte_gpu_cuda_log(ERR, "CUDA Driver library not found.\n"); + return -ENOTSUP; + } + + /* Load initial CUDA functions */ + if (cuda_sym_func_loader()) { + rte_gpu_cuda_log(ERR, "CUDA functions not found in library.\n"); + return -ENOTSUP; + } + + /* + * Required to initialize the CUDA Driver. + * Multiple calls of cuInit() will return immediately + * without making any relevant change + */ + sym_cuInit(0); + + res = sym_cuDriverGetVersion(&cuda_driver_version); + if (res != 0) { + rte_gpu_cuda_log(ERR, "cuDriverGetVersion failed with %d\n", res); + return -ENOTSUP; + } + + if (cuda_driver_version < CUDA_DRIVER_MIN_VERSION) { + rte_gpu_cuda_log(ERR, "CUDA Driver version found is %d Minimum requirement is %d\n", + cuda_driver_version, CUDA_DRIVER_MIN_VERSION); + return -ENOTSUP; + } + + if (cuda_pfn_func_loader()) { + rte_gpu_cuda_log(ERR, "CUDA PFN functions not found in library.\n"); + return -ENOTSUP; + } + } + + /* Fill HW specific part of device structure */ + dev->device = &pci_dev->device; + dev->mpshared->info.numa_node = pci_dev->device.numa_node; + + /* Get NVIDIA GPU Device descriptor */ + res = pfn_cuDeviceGetByPCIBusId(&cu_dev_id, dev->device->name); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuDeviceGetByPCIBusId name %s failed with %d: %s.\n", + dev->device->name, res, err_string); + + return -1; + } + + res = pfn_cuDevicePrimaryCtxRetain(&pctx, cu_dev_id); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuDevicePrimaryCtxRetain name %s failed with %d: %s.\n", + dev->device->name, res, err_string); + + return -1; + } + + res = pfn_cuCtxGetApiVersion(pctx, &cuda_api_version); + if (res != 0) { + rte_gpu_cuda_log(ERR, "cuCtxGetApiVersion failed with %d\n", res); + return -ENOTSUP; + } + + if (cuda_api_version < CUDA_API_MIN_VERSION) { + rte_gpu_cuda_log(ERR, "CUDA API version found is %d Minimum requirement is %d\n", + cuda_api_version, CUDA_API_MIN_VERSION); + return -ENOTSUP; + } + + dev->mpshared->info.context = (uint64_t) pctx; + + /* + * GPU Device generic info + */ + + /* Processor count */ + res = pfn_cuDeviceGetAttribute(&(processor_count), + CU_DEV_ATTR_MULTIPROCESSOR_COUNT, + cu_dev_id); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuDeviceGetAttribute failed with %s.\n", + err_string); + + return -1; + } + dev->mpshared->info.processor_count = (uint32_t)processor_count; + + /* Total memory */ + res = pfn_cuDeviceTotalMem(&dev->mpshared->info.total_memory, cu_dev_id); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuDeviceTotalMem failed with %s.\n", + err_string); + + return -1; + } + + /* + * GPU Device private info + */ + dev->mpshared->dev_private = rte_zmalloc(NULL, + sizeof(struct cuda_info), + RTE_CACHE_LINE_SIZE); + if (dev->mpshared->dev_private == NULL) { + rte_gpu_cuda_log(ERR, + "Failed to allocate memory for GPU process private.\n"); + + return -1; + } + + private = (struct cuda_info *)dev->mpshared->dev_private; + private->cu_dev = cu_dev_id; + res = pfn_cuDeviceGetName(private->gpu_name, + RTE_DEV_NAME_MAX_LEN, + cu_dev_id); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuDeviceGetName failed with %s.\n", + err_string); + + return -1; + } + + res = pfn_cuDeviceGetAttribute(&(private->gdr_supported), + CU_DEV_ATTR_GPU_DIRECT_RDMA_SUPPORTED, + cu_dev_id); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuDeviceGetAttribute failed with %s.\n", + err_string); + + return -1; + } + + if (private->gdr_supported == 0) + rte_gpu_cuda_log(WARNING, + "GPU %s doesn't support GPUDirect RDMA.\n", + pci_dev->device.name); + + res = pfn_cuDeviceGetAttribute(&(private->gdr_write_ordering), + CU_DEV_ATTR_GPU_DIRECT_RDMA_WRITES_ORDERING, + cu_dev_id); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuDeviceGetAttribute failed with %s.\n", + err_string); + + return -1; + } + + if (private->gdr_write_ordering == CU_GDR_WRITES_ORDERING_NONE) { + res = pfn_cuDeviceGetAttribute(&(private->gdr_flush_type), + CU_DEV_ATTR_GPU_DIRECT_RDMA_FLUSH_WRITES_OPTIONS, + cu_dev_id); + if (res != 0) { + pfn_cuGetErrorString(res, &(err_string)); + rte_gpu_cuda_log(ERR, + "cuDeviceGetAttribute failed with %s.\n", + err_string); + + return -1; + } + + if (private->gdr_flush_type != CU_FLUSH_GDR_WRITES_OPTION_HOST) { + rte_gpu_cuda_log(ERR, + "GPUDirect RDMA flush writes API is not supported.\n"); + } + } + + dev->ops.dev_info_get = cuda_dev_info_get; + dev->ops.dev_close = cuda_dev_close; + dev->ops.mem_alloc = cuda_mem_alloc; + dev->ops.mem_free = cuda_mem_free; + dev->ops.mem_register = cuda_mem_register; + dev->ops.mem_unregister = cuda_mem_unregister; + dev->ops.wmb = cuda_wmb; + + rte_gpu_complete_new(dev); + + rte_gpu_cuda_log_debug("dev id = %u name = %s\n", dev->mpshared->info.dev_id, private->gpu_name); + + return 0; +} + +static int +cuda_gpu_remove(struct rte_pci_device *pci_dev) +{ + struct rte_gpu *dev; + int ret; + uint8_t gpu_id; + + if (pci_dev == NULL) + return -EINVAL; + + dev = rte_gpu_get_by_name(pci_dev->device.name); + if (dev == NULL) { + rte_gpu_cuda_log(ERR, + "Couldn't find HW dev \"%s\" to uninitialise it", + pci_dev->device.name); + return -ENODEV; + } + gpu_id = dev->mpshared->info.dev_id; + + /* release dev from library */ + ret = rte_gpu_release(dev); + if (ret) + rte_gpu_cuda_log(ERR, "Device %i failed to uninit: %i", gpu_id, ret); + + rte_gpu_cuda_log_debug("Destroyed dev = %u", gpu_id); + + return 0; +} + +static struct rte_pci_driver rte_cuda_driver = { + .id_table = pci_id_cuda_map, + .drv_flags = RTE_PCI_DRV_WC_ACTIVATE, + .probe = cuda_gpu_probe, + .remove = cuda_gpu_remove, +}; + +RTE_PMD_REGISTER_PCI(gpu_cuda, rte_cuda_driver); +RTE_PMD_REGISTER_PCI_TABLE(gpu_cuda, pci_id_cuda_map); +RTE_PMD_REGISTER_KMOD_DEP(gpu_cuda, "* nvidia & (nv_peer_mem | nvpeer_mem)"); diff --git a/drivers/gpu/cuda/cuda_loader.h b/drivers/gpu/cuda/cuda_loader.h new file mode 100644 index 0000000000..7d12ed5c8a --- /dev/null +++ b/drivers/gpu/cuda/cuda_loader.h @@ -0,0 +1,301 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright (c) 2021 NVIDIA Corporation & Affiliates + */ + +/* + * This header is inspired from cuda.h and cudaTypes.h + * tipically found in /usr/local/cuda/include + */ + +#ifndef DPDK_CUDA_LOADER_H +#define DPDK_CUDA_LOADER_H + +#include <stdio.h> +#include <stdlib.h> +#include <stdint.h> +#include <rte_bitops.h> + +#if defined(__LP64__) +typedef unsigned long long cuDevPtr_v2; +#else +typedef unsigned int cuDevPtr_v2; +#endif +typedef cuDevPtr_v2 cuDevPtr; + +typedef int cuDev_v1; +typedef cuDev_v1 cuDev; +typedef struct CUctx_st *CUcontext; + +enum cuError { + SUCCESS = 0, + ERROR_INVALID_VALUE = 1, + ERROR_OUT_OF_MEMORY = 2, + ERROR_NOT_INITIALIZED = 3, + ERROR_DEINITIALIZED = 4, + ERROR_PROFILER_DISABLED = 5, + ERROR_PROFILER_NOT_INITIALIZED = 6, + ERROR_PROFILER_ALREADY_STARTED = 7, + ERROR_PROFILER_ALREADY_STOPPED = 8, + ERROR_STUB_LIBRARY = 34, + ERROR_NO_DEVICE = 100, + ERROR_INVALID_DEVICE = 101, + ERROR_DEVICE_NOT_LICENSED = 102, + ERROR_INVALID_IMAGE = 200, + ERROR_INVALID_CONTEXT = 201, + ERROR_CONTEXT_ALREADY_CURRENT = 202, + ERROR_MAP_FAILED = 205, + ERROR_UNMAP_FAILED = 206, + ERROR_ARRAY_IS_MAPPED = 207, + ERROR_ALREADY_MAPPED = 208, + ERROR_NO_BINARY_FOR_GPU = 209, + ERROR_ALREADY_ACQUIRED = 210, + ERROR_NOT_MAPPED = 211, + ERROR_NOT_MAPPED_AS_ARRAY = 212, + ERROR_NOT_MAPPED_AS_POINTER = 213, + ERROR_ECC_UNCORRECTABLE = 214, + ERROR_UNSUPPORTED_LIMIT = 215, + ERROR_CONTEXT_ALREADY_IN_USE = 216, + ERROR_PEER_ACCESS_UNSUPPORTED = 217, + ERROR_INVALID_PTX = 218, + ERROR_INVALID_GRAPHICS_CONTEXT = 219, + ERROR_NVLINK_UNCORRECTABLE = 220, + ERROR_JIT_COMPILER_NOT_FOUND = 221, + ERROR_UNSUPPORTED_PTX_VERSION = 222, + ERROR_JIT_COMPILATION_DISABLED = 223, + ERROR_UNSUPPORTED_EXEC_AFFINITY = 224, + ERROR_INVALID_SOURCE = 300, + ERROR_FILE_NOT_FOUND = 301, + ERROR_SHARED_OBJECT_SYMBOL_NOT_FOUND = 302, + ERROR_SHARED_OBJECT_INIT_FAILED = 303, + ERROR_OPERATING_SYSTEM = 304, + ERROR_INVALID_HANDLE = 400, + ERROR_ILLEGAL_STATE = 401, + ERROR_NOT_FOUND = 500, + ERROR_NOT_READY = 600, + ERROR_ILLEGAL_ADDRESS = 700, + ERROR_LAUNCH_OUT_OF_RESOURCES = 701, + ERROR_LAUNCH_TIMEOUT = 702, + ERROR_LAUNCH_INCOMPATIBLE_TEXTURING = 703, + ERROR_PEER_ACCESS_ALREADY_ENABLED = 704, + ERROR_PEER_ACCESS_NOT_ENABLED = 705, + ERROR_PRIMARY_CONTEXT_ACTIVE = 708, + ERROR_CONTEXT_IS_DESTROYED = 709, + ERROR_ASSERT = 710, + ERROR_TOO_MANY_PEERS = 711, + ERROR_HOST_MEMORY_ALREADY_REGISTERED = 712, + ERROR_HOST_MEMORY_NOT_REGISTERED = 713, + ERROR_HARDWARE_STACK_ERROR = 714, + ERROR_ILLEGAL_INSTRUCTION = 715, + ERROR_MISALIGNED_ADDRESS = 716, + ERROR_INVALID_ADDRESS_SPACE = 717, + ERROR_INVALID_PC = 718, + ERROR_LAUNCH_FAILED = 719, + ERROR_COOPERATIVE_LAUNCH_TOO_LARGE = 720, + ERROR_NOT_PERMITTED = 800, + ERROR_NOT_SUPPORTED = 801, + ERROR_SYSTEM_NOT_READY = 802, + ERROR_SYSTEM_DRIVER_MISMATCH = 803, + ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE = 804, + ERROR_MPS_CONNECTION_FAILED = 805, + ERROR_MPS_RPC_FAILURE = 806, + ERROR_MPS_SERVER_NOT_READY = 807, + ERROR_MPS_MAX_CLIENTS_REACHED = 808, + ERROR_MPS_MAX_CONNECTIONS_REACHED = 809, + ERROR_STREAM_CAPTURE_UNSUPPORTED = 900, + ERROR_STREAM_CAPTURE_INVALIDATED = 901, + ERROR_STREAM_CAPTURE_MERGE = 902, + ERROR_STREAM_CAPTURE_UNMATCHED = 903, + ERROR_STREAM_CAPTURE_UNJOINED = 904, + ERROR_STREAM_CAPTURE_ISOLATION = 905, + ERROR_STREAM_CAPTURE_IMPLICIT = 906, + ERROR_CAPTURED_EVENT = 907, + ERROR_STREAM_CAPTURE_WRONG_THREAD = 908, + ERROR_TIMEOUT = 909, + ERROR_GRAPH_EXEC_UPDATE_FAILURE = 910, + ERROR_EXTERNAL_DEVICE = 911, + ERROR_UNKNOWN = 999 +}; + +/* + * Execution Affinity Types. Useful for MPS to detect number of SMs + * associated to a CUDA context v3. + */ +enum cuExecAffinityParamType { + CU_EXEC_AFFINITY_TYPE_SM_COUNT = 0, + CU_EXEC_AFFINITY_TYPE_MAX +}; + +/* + * Number of SMs associated to a context. + */ +struct cuExecAffinitySMCount { + unsigned int val; + /* The number of SMs the context is limited to use. */ +} cuExecAffinitySMCount; + +/** + * Execution Affinity Parameters + */ +struct cuExecAffinityParams { + enum cuExecAffinityParamType type; + union { + struct cuExecAffinitySMCount smCount; + } param; +}; + +/* GPU device properties to query */ +enum cuDevAttr { + CU_DEV_ATTR_MULTIPROCESSOR_COUNT = 16, + /* Number of multiprocessors on device */ + CU_DEV_ATTR_CAN_USE_HOST_POINTER_FOR_REGISTERED_MEM = 91, + /* Device can access host registered memory at the same virtual address as the CPU */ + CU_DEV_ATTR_GPU_DIRECT_RDMA_SUPPORTED = 116, + /* Device supports GPUDirect RDMA APIs, like nvidia_p2p_get_pages (see https://docs.nvidia.com/cuda/gpudirect-rdma for more information) */ + CU_DEV_ATTR_GPU_DIRECT_RDMA_FLUSH_WRITES_OPTIONS = 117, + /* The returned attribute shall be interpreted as a bitmask, where the individual bits are described by the cuFlushGDRWriteOpts enum */ + CU_DEV_ATTR_GPU_DIRECT_RDMA_WRITES_ORDERING = 118, + /* GPUDirect RDMA writes to the device do not need to be flushed for consumers within the scope indicated by the returned attribute. See cuGDRWriteOrdering for the numerical values returned here. */ +}; + +/* Memory pointer info */ +enum cuPtrAttr { + CU_PTR_ATTR_CONTEXT = 1, + /* The CUcontext on which a pointer was allocated or registered */ + CU_PTR_ATTR_MEMORY_TYPE = 2, + /* The CUmemorytype describing the physical location of a pointer */ + CU_PTR_ATTR_DEVICE_POINTER = 3, + /* The address at which a pointer's memory may be accessed on the device */ + CU_PTR_ATTR_HOST_POINTER = 4, + /* The address at which a pointer's memory may be accessed on the host */ + CU_PTR_ATTR_P2P_TOKENS = 5, + /* A pair of tokens for use with the nv-p2p.h Linux kernel interface */ + CU_PTR_ATTR_SYNC_MEMOPS = 6, + /* Synchronize every synchronous memory operation initiated on this region */ + CU_PTR_ATTR_BUFFER_ID = 7, + /* A process-wide unique ID for an allocated memory region*/ + CU_PTR_ATTR_IS_MANAGED = 8, + /* Indicates if the pointer points to managed memory */ + CU_PTR_ATTR_DEVICE_ORDINAL = 9, + /* A device ordinal of a device on which a pointer was allocated or registered */ + CU_PTR_ATTR_IS_LEGACY_CUDA_IPC_CAPABLE = 10, + /* 1 if this pointer maps to an allocation that is suitable for cudaIpcGetMemHandle, 0 otherwise **/ + CU_PTR_ATTR_RANGE_START_ADDR = 11, + /* Starting address for this requested pointer */ + CU_PTR_ATTR_RANGE_SIZE = 12, + /* Size of the address range for this requested pointer */ + CU_PTR_ATTR_MAPPED = 13, + /* 1 if this pointer is in a valid address range that is mapped to a backing allocation, 0 otherwise **/ + CU_PTR_ATTR_ALLOWED_HANDLE_TYPES = 14, + /* Bitmask of allowed CUmemAllocationHandleType for this allocation **/ + CU_PTR_ATTR_IS_GPU_DIRECT_RDMA_CAPABLE = 15, + /* 1 if the memory this pointer is referencing can be used with the GPUDirect RDMA API **/ + CU_PTR_ATTR_ACCESS_FLAGS = 16, + /* Returns the access flags the device associated with the current context has on the corresponding memory referenced by the pointer given */ + CU_PTR_ATTR_MEMPOOL_HANDLE = 17 + /* Returns the mempool handle for the allocation if it was allocated from a mempool. Otherwise returns NULL. **/ +}; + +/* GPUDirect RDMA flush option types */ +#define CU_FLUSH_GDR_WRITES_OPTION_HOST RTE_BIT32(0) +/* cuFlushGPUDirectRDMAWrites() and its CUDA Runtime API counterpart are supported on the device. */ +#define CU_FLUSH_GDR_WRITES_OPTION_MEMOPS RTE_BIT32(1) +/* The CU_STREAM_WAIT_VALUE_FLUSH flag and the CU_STREAM_MEM_OP_FLUSH_REMOTE_WRITES MemOp are supported on the device. */ + +/* Type of platform native ordering for GPUDirect RDMA writes */ +#define CU_GDR_WRITES_ORDERING_NONE 0 +/* The device does not natively support ordering of remote writes. cuFlushGPUDirectRDMAWrites() can be leveraged if supported. */ +#define CU_GDR_WRITES_ORDERING_OWNER 100 +/* Natively, the device can consistently consume remote writes, although other CUDA devices may not. */ +#define CU_GDR_WRITES_ORDERING_ALL_DEVICES 200 +/* Any CUDA device in the system can consistently consume remote writes to this device. */ + +/* Device scope for cuFlushGPUDirectRDMAWrites */ +enum cuFlushGDRScope { + CU_FLUSH_GDR_WRITES_TO_OWNER = 100, + /* Blocks until remote writes are visible to the CUDA device context owning the data. */ + CU_FLUSH_GDR_WRITES_TO_ALL_DEVICES = 200 + /* Blocks until remote writes are visible to all CUDA device contexts. */ +}; + +/* Targets for cuFlushGPUDirectRDMAWrites */ +enum cuFlushGDRTarget { + /* Target is currently active CUDA device context. */ + CU_FLUSH_GDR_WRITES_TARGET_CURRENT_CTX = 0 +}; + +#define CU_MHOST_REGISTER_PORTABLE 0x01 +#define CU_MHOST_REGISTER_DEVICEMAP 0x02 +#define CU_MHOST_REGISTER_IOMEMORY 0x04 +#define CU_MHOST_REGISTER_READ_ONLY 0x08 + +extern enum cuError (*sym_cuInit)(unsigned int flags); +extern enum cuError (*sym_cuDriverGetVersion)(int *driverVersion); +extern enum cuError (*sym_cuGetProcAddress)(const char *symbol, void **pfn, int cudaVersion, uint64_t flags); + +/* Dynamically loaded symbols with cuGetProcAddress with proper API version */ + +#ifdef __cplusplus +extern "C" { +#endif + +/* Generic */ +#define PFN_cuGetErrorString PFN_cuGetErrorString_v6000 +#define PFN_cuGetErrorName PFN_cuGetErrorName_v6000 +#define PFN_cuPointerSetAttribute PFN_cuPointerSetAttribute_v6000 +#define PFN_cuDeviceGetAttribute PFN_cuDeviceGetAttribute_v2000 + +/* cuDevice */ +#define PFN_cuDeviceGetByPCIBusId PFN_cuDeviceGetByPCIBusId_v4010 +#define PFN_cuDevicePrimaryCtxRetain PFN_cuDevicePrimaryCtxRetain_v7000 +#define PFN_cuDevicePrimaryCtxRelease PFN_cuDevicePrimaryCtxRelease_v11000 +#define PFN_cuDeviceTotalMem PFN_cuDeviceTotalMem_v3020 +#define PFN_cuDeviceGetName PFN_cuDeviceGetName_v2000 + +/* cuCtx */ +#define PFN_cuCtxGetApiVersion PFN_cuCtxGetApiVersion_v3020 +#define PFN_cuCtxSetCurrent PFN_cuCtxSetCurrent_v4000 +#define PFN_cuCtxGetCurrent PFN_cuCtxGetCurrent_v4000 +#define PFN_cuCtxGetDevice PFN_cuCtxGetDevice_v2000 +#define PFN_cuCtxGetExecAffinity PFN_cuCtxGetExecAffinity_v11040 + +/* cuMem */ +#define PFN_cuMemAlloc PFN_cuMemAlloc_v3020 +#define PFN_cuMemFree PFN_cuMemFree_v3020 +#define PFN_cuMemHostRegister PFN_cuMemHostRegister_v6050 +#define PFN_cuMemHostUnregister PFN_cuMemHostUnregister_v4000 +#define PFN_cuMemHostGetDevicePointer PFN_cuMemHostGetDevicePointer_v3020 +#define PFN_cuFlushGPUDirectRDMAWrites PFN_cuFlushGPUDirectRDMAWrites_v11030 + +/* Generic */ +typedef enum cuError (*PFN_cuGetErrorString_v6000)(enum cuError error, const char **pStr); +typedef enum cuError (*PFN_cuGetErrorName_v6000)(enum cuError error, const char **pStr); +typedef enum cuError (*PFN_cuPointerSetAttribute_v6000)(const void *value, enum cuPtrAttr attribute, cuDevPtr_v2 ptr); +typedef enum cuError (*PFN_cuDeviceGetAttribute_v2000)(int *pi, enum cuDevAttr attrib, cuDev_v1 dev); + +/* Device */ +typedef enum cuError (*PFN_cuDeviceGetByPCIBusId_v4010)(cuDev_v1 *dev, const char *pciBusId); +typedef enum cuError (*PFN_cuDevicePrimaryCtxRetain_v7000)(CUcontext *pctx, cuDev_v1 dev); +typedef enum cuError (*PFN_cuDevicePrimaryCtxRelease_v11000)(cuDev_v1 dev); +typedef enum cuError (*PFN_cuDeviceTotalMem_v3020)(size_t *bytes, cuDev_v1 dev); +typedef enum cuError (*PFN_cuDeviceGetName_v2000)(char *name, int len, cuDev_v1 dev); + +/* Context */ +typedef enum cuError (*PFN_cuCtxGetApiVersion_v3020)(CUcontext ctx, unsigned int *version); +typedef enum cuError (*PFN_cuCtxSetCurrent_v4000)(CUcontext ctx); +typedef enum cuError (*PFN_cuCtxGetCurrent_v4000)(CUcontext *pctx); +typedef enum cuError (*PFN_cuCtxGetDevice_v2000)(cuDev_v1 *device); +typedef enum cuError (*PFN_cuCtxGetExecAffinity_v11040)(struct cuExecAffinityParams *pExecAffinity, enum cuExecAffinityParamType type); + +/* Memory */ +typedef enum cuError (*PFN_cuMemAlloc_v3020)(cuDevPtr_v2 *dptr, size_t bytesize); +typedef enum cuError (*PFN_cuMemFree_v3020)(cuDevPtr_v2 dptr); +typedef enum cuError (*PFN_cuMemHostRegister_v6050)(void *p, size_t bytesize, unsigned int Flags); +typedef enum cuError (*PFN_cuMemHostUnregister_v4000)(void *p); +typedef enum cuError (*PFN_cuMemHostGetDevicePointer_v3020)(cuDevPtr_v2 *pdptr, void *p, unsigned int Flags); +typedef enum cuError (*PFN_cuFlushGPUDirectRDMAWrites_v11030)(enum cuFlushGDRTarget target, enum cuFlushGDRScope scope); + +#ifdef __cplusplus +} +#endif /* __cplusplus */ + +#endif diff --git a/drivers/gpu/cuda/meson.build b/drivers/gpu/cuda/meson.build new file mode 100644 index 0000000000..f2a3095d8d --- /dev/null +++ b/drivers/gpu/cuda/meson.build @@ -0,0 +1,10 @@ +# SPDX-License-Identifier: BSD-3-Clause +# Copyright (c) 2021 NVIDIA Corporation & Affiliates + +if not is_linux + build = false + reason = 'only supported on Linux' +endif + +deps += ['gpudev','pci','bus_pci'] +sources = files('cuda.c') diff --git a/drivers/gpu/cuda/version.map b/drivers/gpu/cuda/version.map new file mode 100644 index 0000000000..4a76d1d52d --- /dev/null +++ b/drivers/gpu/cuda/version.map @@ -0,0 +1,3 @@ +DPDK_21 { + local: *; +}; diff --git a/drivers/gpu/meson.build b/drivers/gpu/meson.build index e51ad3381b..601bedcd61 100644 --- a/drivers/gpu/meson.build +++ b/drivers/gpu/meson.build @@ -1,4 +1,4 @@ # SPDX-License-Identifier: BSD-3-Clause # Copyright (c) 2021 NVIDIA Corporation & Affiliates -drivers = [] +drivers = [ 'cuda' ] -- 2.17.1

