[PATCH v2 1/1] vhost: integrate dmadev in asynchronous datapath

Jiayu Hu Mon, 24 Jan 2022 00:37:12 -0800

Since dmadev is introduced in 21.11, to avoid the overhead of vhost DMA
abstraction layer and simplify application logics, this patch integrates
dmadev in asynchronous data path.


Signed-off-by: Jiayu Hu <[email protected]>
Signed-off-by: Sunil Pai G <[email protected]>
---
 doc/guides/prog_guide/vhost_lib.rst |  95 ++++-----
 examples/vhost/Makefile             |   2 +-
 examples/vhost/ioat.c               | 218 --------------------
 examples/vhost/ioat.h               |  63 ------
 examples/vhost/main.c               | 255 ++++++++++++++++++-----
 examples/vhost/main.h               |  11 +
 examples/vhost/meson.build          |   6 +-
 lib/vhost/meson.build               |   2 +-
 lib/vhost/rte_vhost.h               |   2 +
 lib/vhost/rte_vhost_async.h         | 132 +++++-------
 lib/vhost/version.map               |   3 +
 lib/vhost/vhost.c                   | 148 ++++++++++----
 lib/vhost/vhost.h                   |  64 +++++-
 lib/vhost/vhost_user.c              |   2 +
 lib/vhost/virtio_net.c              | 305 +++++++++++++++++++++++-----
 15 files changed, 744 insertions(+), 564 deletions(-)
 delete mode 100644 examples/vhost/ioat.c
 delete mode 100644 examples/vhost/ioat.h

diff --git a/doc/guides/prog_guide/vhost_lib.rst 
b/doc/guides/prog_guide/vhost_lib.rst
index 76f5d303c9..acc10ea851 100644
--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst
@@ -106,12 +106,11 @@ The following is an overview of some key Vhost API 
functions:
   - ``RTE_VHOST_USER_ASYNC_COPY``
 
     Asynchronous data path will be enabled when this flag is set. Async data
-    path allows applications to register async copy devices (typically
-    hardware DMA channels) to the vhost queues. Vhost leverages the copy
-    device registered to free CPU from memory copy operations. A set of
-    async data path APIs are defined for DPDK applications to make use of
-    the async capability. Only packets enqueued/dequeued by async APIs are
-    processed through the async data path.
+    path allows applications to register DMA channels to the vhost queues.
+    Vhost leverages the registered DMA devices to free CPU from memory copy
+    operations. A set of async data path APIs are defined for DPDK applications
+    to make use of the async capability. Only packets enqueued/dequeued by
+    async APIs are processed through the async data path.
 
     Currently this feature is only implemented on split ring enqueue data
     path.
@@ -218,52 +217,30 @@ The following is an overview of some key Vhost API 
functions:
 
   Enable or disable zero copy feature of the vhost crypto backend.
 
-* ``rte_vhost_async_channel_register(vid, queue_id, config, ops)``
+* ``rte_vhost_async_dma_configure(dmas_id, count, poll_factor)``
 
-  Register an async copy device channel for a vhost queue after vring
-  is enabled. Following device ``config`` must be specified together
-  with the registration:
+  Tell vhost what DMA devices are going to use. This function needs to
+  be called before register async data-path for vring.
 
-  * ``features``
+* ``rte_vhost_async_channel_register(vid, queue_id)``
 
-    This field is used to specify async copy device features.
+  Register async DMA acceleration for a vhost queue after vring is enabled.
 
-    ``RTE_VHOST_ASYNC_INORDER`` represents the async copy device can
-    guarantee the order of copy completion is the same as the order
-    of copy submission.
+* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id)``
 
-    Currently, only ``RTE_VHOST_ASYNC_INORDER`` capable device is
-    supported by vhost.
-
-  Applications must provide following ``ops`` callbacks for vhost lib to
-  work with the async copy devices:
-
-  * ``transfer_data(vid, queue_id, descs, opaque_data, count)``
-
-    vhost invokes this function to submit copy data to the async devices.
-    For non-async_inorder capable devices, ``opaque_data`` could be used
-    for identifying the completed packets.
-
-  * ``check_completed_copies(vid, queue_id, opaque_data, max_packets)``
-
-    vhost invokes this function to get the copy data completed by async
-    devices.
-
-* ``rte_vhost_async_channel_register_thread_unsafe(vid, queue_id, config, 
ops)``
-
-  Register an async copy device channel for a vhost queue without
-  performing any locking.
+  Register async DMA acceleration for a vhost queue without performing
+  any locking.
 
   This function is only safe to call in vhost callback functions
   (i.e., struct rte_vhost_device_ops).
 
 * ``rte_vhost_async_channel_unregister(vid, queue_id)``
 
-  Unregister the async copy device channel from a vhost queue.
+  Unregister the async DMA acceleration from a vhost queue.
   Unregistration will fail, if the vhost queue has in-flight
   packets that are not completed.
 
-  Unregister async copy devices in vring_state_changed() may
+  Unregister async DMA acceleration in vring_state_changed() may
   fail, as this API tries to acquire the spinlock of vhost
   queue. The recommended way is to unregister async copy
   devices for all vhost queues in destroy_device(), when a
@@ -271,24 +248,19 @@ The following is an overview of some key Vhost API 
functions:
 
 * ``rte_vhost_async_channel_unregister_thread_unsafe(vid, queue_id)``
 
-  Unregister the async copy device channel for a vhost queue without
-  performing any locking.
+  Unregister async DMA acceleration for a vhost queue without performing
+  any locking.
 
   This function is only safe to call in vhost callback functions
   (i.e., struct rte_vhost_device_ops).
 
-* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, comp_pkts, 
comp_count)``
+* ``rte_vhost_submit_enqueue_burst(vid, queue_id, pkts, count, dma_id, 
vchan_id)``
 
   Submit an enqueue request to transmit ``count`` packets from host to guest
-  by async data path. Successfully enqueued packets can be transfer completed
-  or being occupied by DMA engines; transfer completed packets are returned in
-  ``comp_pkts``, but others are not guaranteed to finish, when this API
-  call returns.
+  by async data path. Applications must not free the packets submitted for
+  enqueue until the packets are completed.
 
-  Applications must not free the packets submitted for enqueue until the
-  packets are completed.
-
-* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count)``
+* ``rte_vhost_poll_enqueue_completed(vid, queue_id, pkts, count, dma_id, 
vchan_id)``
 
   Poll enqueue completion status from async data path. Completed packets
   are returned to applications through ``pkts``.
@@ -298,7 +270,7 @@ The following is an overview of some key Vhost API 
functions:
   This function returns the amount of in-flight packets for the vhost
   queue using async acceleration.
 
-* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count)``
+* ``rte_vhost_clear_queue_thread_unsafe(vid, queue_id, **pkts, count, dma_id, 
vchan_id)``
 
   Clear inflight packets which are submitted to DMA engine in vhost async data
   path. Completed packets are returned to applications through ``pkts``.
@@ -442,3 +414,26 @@ Finally, a set of device ops is defined for device 
specific operations:
 * ``get_notify_area``
 
   Called to get the notify area info of the queue.
+
+Vhost asynchronous data path
+----------------------------
+
+Vhost asynchronous data path leverages DMA devices to offload memory
+copies from the CPU and it is implemented in an asynchronous way. It
+enables applications, like OVS, to save CPU cycles and hide memory copy
+overhead, thus achieving higher throughput.
+
+Vhost doesn't manage DMA devices and applications, like OVS, need to
+manage and configure DMA devices. Applications need to tell vhost what
+DMA devices to use in every data path function call. This design enables
+the flexibility for applications to dynamically use DMA channels in
+different function modules, not limited in vhost.
+
+In addition, vhost supports M:N mapping between vrings and DMA virtual
+channels. Specifically, one vring can use multiple different DMA channels
+and one DMA channel can be shared by multiple vrings at the same time.
+The reason of enabling one vring to use multiple DMA channels is that
+it's possible that more than one dataplane threads enqueue packets to
+the same vring with their own DMA virtual channels. Besides, the number
+of DMA devices is limited. For the purpose of scaling, it's necessary to
+support sharing DMA channels among vrings.
diff --git a/examples/vhost/Makefile b/examples/vhost/Makefile
index 587ea2ab47..975a5dfe40 100644
--- a/examples/vhost/Makefile
+++ b/examples/vhost/Makefile
@@ -5,7 +5,7 @@
 APP = vhost-switch
 
 # all source are stored in SRCS-y
-SRCS-y := main.c virtio_net.c ioat.c
+SRCS-y := main.c virtio_net.c
 
 PKGCONF ?= pkg-config
 
diff --git a/examples/vhost/ioat.c b/examples/vhost/ioat.c
deleted file mode 100644
index 9aeeb12fd9..0000000000
--- a/examples/vhost/ioat.c
+++ /dev/null
@@ -1,218 +0,0 @@
-/* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2020 Intel Corporation
- */
-
-#include <sys/uio.h>
-#ifdef RTE_RAW_IOAT
-#include <rte_rawdev.h>
-#include <rte_ioat_rawdev.h>
-
-#include "ioat.h"
-#include "main.h"
-
-struct dma_for_vhost dma_bind[MAX_VHOST_DEVICE];
-
-struct packet_tracker {
-       unsigned short size_track[MAX_ENQUEUED_SIZE];
-       unsigned short next_read;
-       unsigned short next_write;
-       unsigned short last_remain;
-       unsigned short ioat_space;
-};
-
-struct packet_tracker cb_tracker[MAX_VHOST_DEVICE];
-
-int
-open_ioat(const char *value)
-{
-       struct dma_for_vhost *dma_info = dma_bind;
-       char *input = strndup(value, strlen(value) + 1);
-       char *addrs = input;
-       char *ptrs[2];
-       char *start, *end, *substr;
-       int64_t vid, vring_id;
-       struct rte_ioat_rawdev_config config;
-       struct rte_rawdev_info info = { .dev_private = &config };
-       char name[32];
-       int dev_id;
-       int ret = 0;
-       uint16_t i = 0;
-       char *dma_arg[MAX_VHOST_DEVICE];
-       int args_nr;
-
-       while (isblank(*addrs))
-               addrs++;
-       if (*addrs == '\0') {
-               ret = -1;
-               goto out;
-       }
-
-       /* process DMA devices within bracket. */
-       addrs++;
-       substr = strtok(addrs, ";]");
-       if (!substr) {
-               ret = -1;
-               goto out;
-       }
-       args_nr = rte_strsplit(substr, strlen(substr),
-                       dma_arg, MAX_VHOST_DEVICE, ',');
-       if (args_nr <= 0) {
-               ret = -1;
-               goto out;
-       }
-       while (i < args_nr) {
-               char *arg_temp = dma_arg[i];
-               uint8_t sub_nr;
-               sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
-               if (sub_nr != 2) {
-                       ret = -1;
-                       goto out;
-               }
-
-               start = strstr(ptrs[0], "txd");
-               if (start == NULL) {
-                       ret = -1;
-                       goto out;
-               }
-
-               start += 3;
-               vid = strtol(start, &end, 0);
-               if (end == start) {
-                       ret = -1;
-                       goto out;
-               }
-
-               vring_id = 0 + VIRTIO_RXQ;
-               if (rte_pci_addr_parse(ptrs[1],
-                               &(dma_info + vid)->dmas[vring_id].addr) < 0) {
-                       ret = -1;
-                       goto out;
-               }
-
-               rte_pci_device_name(&(dma_info + vid)->dmas[vring_id].addr,
-                               name, sizeof(name));
-               dev_id = rte_rawdev_get_dev_id(name);
-               if (dev_id == (uint16_t)(-ENODEV) ||
-               dev_id == (uint16_t)(-EINVAL)) {
-                       ret = -1;
-                       goto out;
-               }
-
-               if (rte_rawdev_info_get(dev_id, &info, sizeof(config)) < 0 ||
-               strstr(info.driver_name, "ioat") == NULL) {
-                       ret = -1;
-                       goto out;
-               }
-
-               (dma_info + vid)->dmas[vring_id].dev_id = dev_id;
-               (dma_info + vid)->dmas[vring_id].is_valid = true;
-               config.ring_size = IOAT_RING_SIZE;
-               config.hdls_disable = true;
-               if (rte_rawdev_configure(dev_id, &info, sizeof(config)) < 0) {
-                       ret = -1;
-                       goto out;
-               }
-               rte_rawdev_start(dev_id);
-               cb_tracker[dev_id].ioat_space = IOAT_RING_SIZE - 1;
-               dma_info->nr++;
-               i++;
-       }
-out:
-       free(input);
-       return ret;
-}
-
-int32_t
-ioat_transfer_data_cb(int vid, uint16_t queue_id,
-               struct rte_vhost_iov_iter *iov_iter,
-               struct rte_vhost_async_status *opaque_data, uint16_t count)
-{
-       uint32_t i_iter;
-       uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2 + VIRTIO_RXQ].dev_id;
-       struct rte_vhost_iov_iter *iter = NULL;
-       unsigned long i_seg;
-       unsigned short mask = MAX_ENQUEUED_SIZE - 1;
-       unsigned short write = cb_tracker[dev_id].next_write;
-
-       if (!opaque_data) {
-               for (i_iter = 0; i_iter < count; i_iter++) {
-                       iter = iov_iter + i_iter;
-                       i_seg = 0;
-                       if (cb_tracker[dev_id].ioat_space < iter->nr_segs)
-                               break;
-                       while (i_seg < iter->nr_segs) {
-                               rte_ioat_enqueue_copy(dev_id,
-                                       (uintptr_t)(iter->iov[i_seg].src_addr),
-                                       (uintptr_t)(iter->iov[i_seg].dst_addr),
-                                       iter->iov[i_seg].len,
-                                       0,
-                                       0);
-                               i_seg++;
-                       }
-                       write &= mask;
-                       cb_tracker[dev_id].size_track[write] = iter->nr_segs;
-                       cb_tracker[dev_id].ioat_space -= iter->nr_segs;
-                       write++;
-               }
-       } else {
-               /* Opaque data is not supported */
-               return -1;
-       }
-       /* ring the doorbell */
-       rte_ioat_perform_ops(dev_id);
-       cb_tracker[dev_id].next_write = write;
-       return i_iter;
-}
-
-int32_t
-ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
-               struct rte_vhost_async_status *opaque_data,
-               uint16_t max_packets)
-{
-       if (!opaque_data) {
-               uintptr_t dump[255];
-               int n_seg;
-               unsigned short read, write;
-               unsigned short nb_packet = 0;
-               unsigned short mask = MAX_ENQUEUED_SIZE - 1;
-               unsigned short i;
-
-               uint16_t dev_id = dma_bind[vid].dmas[queue_id * 2
-                               + VIRTIO_RXQ].dev_id;
-               n_seg = rte_ioat_completed_ops(dev_id, 255, NULL, NULL, dump, 
dump);
-               if (n_seg < 0) {
-                       RTE_LOG(ERR,
-                               VHOST_DATA,
-                               "fail to poll completed buf on IOAT device %u",
-                               dev_id);
-                       return 0;
-               }
-               if (n_seg == 0)
-                       return 0;
-
-               cb_tracker[dev_id].ioat_space += n_seg;
-               n_seg += cb_tracker[dev_id].last_remain;
-
-               read = cb_tracker[dev_id].next_read;
-               write = cb_tracker[dev_id].next_write;
-               for (i = 0; i < max_packets; i++) {
-                       read &= mask;
-                       if (read == write)
-                               break;
-                       if (n_seg >= cb_tracker[dev_id].size_track[read]) {
-                               n_seg -= cb_tracker[dev_id].size_track[read];
-                               read++;
-                               nb_packet++;
-                       } else {
-                               break;
-                       }
-               }
-               cb_tracker[dev_id].next_read = read;
-               cb_tracker[dev_id].last_remain = n_seg;
-               return nb_packet;
-       }
-       /* Opaque data is not supported */
-       return -1;
-}
-
-#endif /* RTE_RAW_IOAT */
diff --git a/examples/vhost/ioat.h b/examples/vhost/ioat.h
deleted file mode 100644
index d9bf717e8d..0000000000
--- a/examples/vhost/ioat.h
+++ /dev/null
@@ -1,63 +0,0 @@
-/* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2020 Intel Corporation
- */
-
-#ifndef _IOAT_H_
-#define _IOAT_H_
-
-#include <rte_vhost.h>
-#include <rte_pci.h>
-#include <rte_vhost_async.h>
-
-#define MAX_VHOST_DEVICE 1024
-#define IOAT_RING_SIZE 4096
-#define MAX_ENQUEUED_SIZE 4096
-
-struct dma_info {
-       struct rte_pci_addr addr;
-       uint16_t dev_id;
-       bool is_valid;
-};
-
-struct dma_for_vhost {
-       struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
-       uint16_t nr;
-};
-
-#ifdef RTE_RAW_IOAT
-int open_ioat(const char *value);
-
-int32_t
-ioat_transfer_data_cb(int vid, uint16_t queue_id,
-               struct rte_vhost_iov_iter *iov_iter,
-               struct rte_vhost_async_status *opaque_data, uint16_t count);
-
-int32_t
-ioat_check_completed_copies_cb(int vid, uint16_t queue_id,
-               struct rte_vhost_async_status *opaque_data,
-               uint16_t max_packets);
-#else
-static int open_ioat(const char *value __rte_unused)
-{
-       return -1;
-}
-
-static int32_t
-ioat_transfer_data_cb(int vid __rte_unused, uint16_t queue_id __rte_unused,
-               struct rte_vhost_iov_iter *iov_iter __rte_unused,
-               struct rte_vhost_async_status *opaque_data __rte_unused,
-               uint16_t count __rte_unused)
-{
-       return -1;
-}
-
-static int32_t
-ioat_check_completed_copies_cb(int vid __rte_unused,
-               uint16_t queue_id __rte_unused,
-               struct rte_vhost_async_status *opaque_data __rte_unused,
-               uint16_t max_packets __rte_unused)
-{
-       return -1;
-}
-#endif
-#endif /* _IOAT_H_ */
diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index 590a77c723..b2c272059e 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -24,8 +24,9 @@
 #include <rte_ip.h>
 #include <rte_tcp.h>
 #include <rte_pause.h>
+#include <rte_dmadev.h>
+#include <rte_vhost_async.h>
 
-#include "ioat.h"
 #include "main.h"
 
 #ifndef MAX_QUEUES
@@ -56,6 +57,13 @@
 #define RTE_TEST_TX_DESC_DEFAULT 512
 
 #define INVALID_PORT_ID 0xFF
+#define INVALID_DMA_ID -1
+
+#define DMA_RING_SIZE 4096
+
+struct dma_for_vhost dma_bind[RTE_MAX_VHOST_DEVICE];
+int16_t dmas_id[RTE_DMADEV_DEFAULT_MAX];
+static int dma_count;
 
 /* mask of enabled ports */
 static uint32_t enabled_port_mask = 0;
@@ -94,10 +102,6 @@ static int client_mode;
 
 static int builtin_net_driver;
 
-static int async_vhost_driver;
-
-static char *dma_type;
-
 /* Specify timeout (in useconds) between retries on RX. */
 static uint32_t burst_rx_delay_time = BURST_RX_WAIT_US;
 /* Specify the number of retries on RX. */
@@ -191,18 +195,150 @@ struct mbuf_table lcore_tx_queue[RTE_MAX_LCORE];
  * Every data core maintains a TX buffer for every vhost device,
  * which is used for batch pkts enqueue for higher performance.
  */
-struct vhost_bufftable *vhost_txbuff[RTE_MAX_LCORE * MAX_VHOST_DEVICE];
+struct vhost_bufftable *vhost_txbuff[RTE_MAX_LCORE * RTE_MAX_VHOST_DEVICE];
 
 #define MBUF_TABLE_DRAIN_TSC   ((rte_get_tsc_hz() + US_PER_S - 1) \
                                 / US_PER_S * BURST_TX_DRAIN_US)
 
+static inline bool
+is_dma_configured(int16_t dev_id)
+{
+       int i;
+
+       for (i = 0; i < dma_count; i++)
+               if (dmas_id[i] == dev_id)
+                       return true;
+       return false;
+}
+
 static inline int
 open_dma(const char *value)
 {
-       if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0)
-               return open_ioat(value);
+       struct dma_for_vhost *dma_info = dma_bind;
+       char *input = strndup(value, strlen(value) + 1);
+       char *addrs = input;
+       char *ptrs[2];
+       char *start, *end, *substr;
+       int64_t vid;
+
+       struct rte_dma_info info;
+       struct rte_dma_conf dev_config = { .nb_vchans = 1 };
+       struct rte_dma_vchan_conf qconf = {
+               .direction = RTE_DMA_DIR_MEM_TO_MEM,
+               .nb_desc = DMA_RING_SIZE
+       };
+
+       int dev_id;
+       int ret = 0;
+       uint16_t i = 0;
+       char *dma_arg[RTE_MAX_VHOST_DEVICE];
+       int args_nr;
+
+       while (isblank(*addrs))
+               addrs++;
+       if (*addrs == '\0') {
+               ret = -1;
+               goto out;
+       }
+
+       /* process DMA devices within bracket. */
+       addrs++;
+       substr = strtok(addrs, ";]");
+       if (!substr) {
+               ret = -1;
+               goto out;
+       }
+
+       args_nr = rte_strsplit(substr, strlen(substr), dma_arg, 
RTE_MAX_VHOST_DEVICE, ',');
+       if (args_nr <= 0) {
+               ret = -1;
+               goto out;
+       }
+
+       while (i < args_nr) {
+               char *arg_temp = dma_arg[i];
+               uint8_t sub_nr;
+
+               sub_nr = rte_strsplit(arg_temp, strlen(arg_temp), ptrs, 2, '@');
+               if (sub_nr != 2) {
+                       ret = -1;
+                       goto out;
+               }
+
+               start = strstr(ptrs[0], "txd");
+               if (start == NULL) {
+                       ret = -1;
+                       goto out;
+               }
+
+               start += 3;
+               vid = strtol(start, &end, 0);
+               if (end == start) {
+                       ret = -1;
+                       goto out;
+               }
+
+               dev_id = rte_dma_get_dev_id_by_name(ptrs[1]);
+               if (dev_id < 0) {
+                       RTE_LOG(ERR, VHOST_CONFIG, "Fail to find DMA %s.\n", 
ptrs[1]);
+                       ret = -1;
+                       goto out;
+               }
+
+               /* DMA device is already configured, so skip */
+               if (is_dma_configured(dev_id))
+                       goto done;
+
+               if (rte_dma_info_get(dev_id, &info) != 0) {
+                       RTE_LOG(ERR, VHOST_CONFIG, "Error with 
rte_dma_info_get()\n");
+                       ret = -1;
+                       goto out;
+               }
+
+               if (info.max_vchans < 1) {
+                       RTE_LOG(ERR, VHOST_CONFIG, "No channels available on 
device %d\n", dev_id);
+                       ret = -1;
+                       goto out;
+               }
 
-       return -1;
+               if (rte_dma_configure(dev_id, &dev_config) != 0) {
+                       RTE_LOG(ERR, VHOST_CONFIG, "Fail to configure DMA 
%d.\n", dev_id);
+                       ret = -1;
+                       goto out;
+               }
+
+               /* Check the max desc supported by DMA device */
+               rte_dma_info_get(dev_id, &info);
+               if (info.nb_vchans != 1) {
+                       RTE_LOG(ERR, VHOST_CONFIG, "No configured queues 
reported by DMA %d.\n",
+                                       dev_id);
+                       ret = -1;
+                       goto out;
+               }
+
+               qconf.nb_desc = RTE_MIN(DMA_RING_SIZE, info.max_desc);
+
+               if (rte_dma_vchan_setup(dev_id, 0, &qconf) != 0) {
+                       RTE_LOG(ERR, VHOST_CONFIG, "Fail to set up DMA %d.\n", 
dev_id);
+                       ret = -1;
+                       goto out;
+               }
+
+               if (rte_dma_start(dev_id) != 0) {
+                       RTE_LOG(ERR, VHOST_CONFIG, "Fail to start DMA %u.\n", 
dev_id);
+                       ret = -1;
+                       goto out;
+               }
+
+               dmas_id[dma_count++] = dev_id;
+
+done:
+               (dma_info + vid)->dmas[VIRTIO_RXQ].dev_id = dev_id;
+               i++;
+       }
+out:
+       free(input);
+       return ret;
 }
 
 /*
@@ -500,8 +636,6 @@ enum {
        OPT_CLIENT_NUM,
 #define OPT_BUILTIN_NET_DRIVER  "builtin-net-driver"
        OPT_BUILTIN_NET_DRIVER_NUM,
-#define OPT_DMA_TYPE            "dma-type"
-       OPT_DMA_TYPE_NUM,
 #define OPT_DMAS                "dmas"
        OPT_DMAS_NUM,
 };
@@ -539,8 +673,6 @@ us_vhost_parse_args(int argc, char **argv)
                                NULL, OPT_CLIENT_NUM},
                {OPT_BUILTIN_NET_DRIVER, no_argument,
                                NULL, OPT_BUILTIN_NET_DRIVER_NUM},
-               {OPT_DMA_TYPE, required_argument,
-                               NULL, OPT_DMA_TYPE_NUM},
                {OPT_DMAS, required_argument,
                                NULL, OPT_DMAS_NUM},
                {NULL, 0, 0, 0},
@@ -661,10 +793,6 @@ us_vhost_parse_args(int argc, char **argv)
                        }
                        break;
 
-               case OPT_DMA_TYPE_NUM:
-                       dma_type = optarg;
-                       break;
-
                case OPT_DMAS_NUM:
                        if (open_dma(optarg) == -1) {
                                RTE_LOG(INFO, VHOST_CONFIG,
@@ -672,7 +800,6 @@ us_vhost_parse_args(int argc, char **argv)
                                us_vhost_usage(prgname);
                                return -1;
                        }
-                       async_vhost_driver = 1;
                        break;
 
                case OPT_CLIENT_NUM:
@@ -841,9 +968,10 @@ complete_async_pkts(struct vhost_dev *vdev)
 {
        struct rte_mbuf *p_cpl[MAX_PKT_BURST];
        uint16_t complete_count;
+       int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
        complete_count = rte_vhost_poll_enqueue_completed(vdev->vid,
-                                       VIRTIO_RXQ, p_cpl, MAX_PKT_BURST);
+                                       VIRTIO_RXQ, p_cpl, MAX_PKT_BURST, 
dma_id, 0);
        if (complete_count) {
                free_pkts(p_cpl, complete_count);
                __atomic_sub_fetch(&vdev->pkts_inflight, complete_count, 
__ATOMIC_SEQ_CST);
@@ -877,17 +1005,18 @@ static __rte_always_inline void
 drain_vhost(struct vhost_dev *vdev)
 {
        uint16_t ret;
-       uint32_t buff_idx = rte_lcore_id() * MAX_VHOST_DEVICE + vdev->vid;
+       uint32_t buff_idx = rte_lcore_id() * RTE_MAX_VHOST_DEVICE + vdev->vid;
        uint16_t nr_xmit = vhost_txbuff[buff_idx]->len;
        struct rte_mbuf **m = vhost_txbuff[buff_idx]->m_table;
 
        if (builtin_net_driver) {
                ret = vs_enqueue_pkts(vdev, VIRTIO_RXQ, m, nr_xmit);
-       } else if (async_vhost_driver) {
+       } else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
                uint16_t enqueue_fail = 0;
+               int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
                complete_async_pkts(vdev);
-               ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, 
nr_xmit);
+               ret = rte_vhost_submit_enqueue_burst(vdev->vid, VIRTIO_RXQ, m, 
nr_xmit, dma_id, 0);
                __atomic_add_fetch(&vdev->pkts_inflight, ret, __ATOMIC_SEQ_CST);
 
                enqueue_fail = nr_xmit - ret;
@@ -905,7 +1034,7 @@ drain_vhost(struct vhost_dev *vdev)
                                __ATOMIC_SEQ_CST);
        }
 
-       if (!async_vhost_driver)
+       if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
                free_pkts(m, nr_xmit);
 }
 
@@ -921,7 +1050,7 @@ drain_vhost_table(void)
                if (unlikely(vdev->remove == 1))
                        continue;
 
-               vhost_txq = vhost_txbuff[lcore_id * MAX_VHOST_DEVICE
+               vhost_txq = vhost_txbuff[lcore_id * RTE_MAX_VHOST_DEVICE
                                                + vdev->vid];
 
                cur_tsc = rte_rdtsc();
@@ -970,7 +1099,7 @@ virtio_tx_local(struct vhost_dev *vdev, struct rte_mbuf *m)
                return 0;
        }
 
-       vhost_txq = vhost_txbuff[lcore_id * MAX_VHOST_DEVICE + dst_vdev->vid];
+       vhost_txq = vhost_txbuff[lcore_id * RTE_MAX_VHOST_DEVICE + 
dst_vdev->vid];
        vhost_txq->m_table[vhost_txq->len++] = m;
 
        if (enable_stats) {
@@ -1211,12 +1340,13 @@ drain_eth_rx(struct vhost_dev *vdev)
        if (builtin_net_driver) {
                enqueue_count = vs_enqueue_pkts(vdev, VIRTIO_RXQ,
                                                pkts, rx_count);
-       } else if (async_vhost_driver) {
+       } else if (dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled) {
                uint16_t enqueue_fail = 0;
+               int16_t dma_id = dma_bind[vdev->vid].dmas[VIRTIO_RXQ].dev_id;
 
                complete_async_pkts(vdev);
                enqueue_count = rte_vhost_submit_enqueue_burst(vdev->vid,
-                                       VIRTIO_RXQ, pkts, rx_count);
+                                       VIRTIO_RXQ, pkts, rx_count, dma_id, 0);
                __atomic_add_fetch(&vdev->pkts_inflight, enqueue_count, 
__ATOMIC_SEQ_CST);
 
                enqueue_fail = rx_count - enqueue_count;
@@ -1235,7 +1365,7 @@ drain_eth_rx(struct vhost_dev *vdev)
                                __ATOMIC_SEQ_CST);
        }
 
-       if (!async_vhost_driver)
+       if (!dma_bind[vdev->vid].dmas[VIRTIO_RXQ].async_enabled)
                free_pkts(pkts, rx_count);
 }
 
@@ -1357,7 +1487,7 @@ destroy_device(int vid)
        }
 
        for (i = 0; i < RTE_MAX_LCORE; i++)
-               rte_free(vhost_txbuff[i * MAX_VHOST_DEVICE + vid]);
+               rte_free(vhost_txbuff[i * RTE_MAX_VHOST_DEVICE + vid]);
 
        if (builtin_net_driver)
                vs_vhost_net_remove(vdev);
@@ -1387,18 +1517,20 @@ destroy_device(int vid)
                "(%d) device has been removed from data core\n",
                vdev->vid);
 
-       if (async_vhost_driver) {
+       if (dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled) {
                uint16_t n_pkt = 0;
+               int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
                struct rte_mbuf *m_cpl[vdev->pkts_inflight];
 
                while (vdev->pkts_inflight) {
                        n_pkt = rte_vhost_clear_queue_thread_unsafe(vid, 
VIRTIO_RXQ,
-                                               m_cpl, vdev->pkts_inflight);
+                                               m_cpl, vdev->pkts_inflight, 
dma_id, 0);
                        free_pkts(m_cpl, n_pkt);
                        __atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, 
__ATOMIC_SEQ_CST);
                }
 
                rte_vhost_async_channel_unregister(vid, VIRTIO_RXQ);
+               dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = false;
        }
 
        rte_free(vdev);
@@ -1425,12 +1557,12 @@ new_device(int vid)
        vdev->vid = vid;
 
        for (i = 0; i < RTE_MAX_LCORE; i++) {
-               vhost_txbuff[i * MAX_VHOST_DEVICE + vid]
+               vhost_txbuff[i * RTE_MAX_VHOST_DEVICE + vid]
                        = rte_zmalloc("vhost bufftable",
                                sizeof(struct vhost_bufftable),
                                RTE_CACHE_LINE_SIZE);
 
-               if (vhost_txbuff[i * MAX_VHOST_DEVICE + vid] == NULL) {
+               if (vhost_txbuff[i * RTE_MAX_VHOST_DEVICE + vid] == NULL) {
                        RTE_LOG(INFO, VHOST_DATA,
                          "(%d) couldn't allocate memory for vhost TX\n", vid);
                        return -1;
@@ -1468,20 +1600,13 @@ new_device(int vid)
                "(%d) device has been added to data core %d\n",
                vid, vdev->coreid);
 
-       if (async_vhost_driver) {
-               struct rte_vhost_async_config config = {0};
-               struct rte_vhost_async_channel_ops channel_ops;
-
-               if (dma_type != NULL && strncmp(dma_type, "ioat", 4) == 0) {
-                       channel_ops.transfer_data = ioat_transfer_data_cb;
-                       channel_ops.check_completed_copies =
-                               ioat_check_completed_copies_cb;
-
-                       config.features = RTE_VHOST_ASYNC_INORDER;
+       if (dma_bind[vid].dmas[VIRTIO_RXQ].dev_id != INVALID_DMA_ID) {
+               int ret;
 
-                       return rte_vhost_async_channel_register(vid, VIRTIO_RXQ,
-                               config, &channel_ops);
-               }
+               ret = rte_vhost_async_channel_register(vid, VIRTIO_RXQ);
+               if (ret == 0)
+                       dma_bind[vid].dmas[VIRTIO_RXQ].async_enabled = true;
+               return ret;
        }
 
        return 0;
@@ -1502,14 +1627,15 @@ vring_state_changed(int vid, uint16_t queue_id, int 
enable)
        if (queue_id != VIRTIO_RXQ)
                return 0;
 
-       if (async_vhost_driver) {
+       if (dma_bind[vid].dmas[queue_id].async_enabled) {
                if (!enable) {
                        uint16_t n_pkt = 0;
+                       int16_t dma_id = dma_bind[vid].dmas[VIRTIO_RXQ].dev_id;
                        struct rte_mbuf *m_cpl[vdev->pkts_inflight];
 
                        while (vdev->pkts_inflight) {
                                n_pkt = 
rte_vhost_clear_queue_thread_unsafe(vid, queue_id,
-                                                       m_cpl, 
vdev->pkts_inflight);
+                                                       m_cpl, 
vdev->pkts_inflight, dma_id, 0);
                                free_pkts(m_cpl, n_pkt);
                                __atomic_sub_fetch(&vdev->pkts_inflight, n_pkt, 
__ATOMIC_SEQ_CST);
                        }
@@ -1657,6 +1783,24 @@ create_mbuf_pool(uint16_t nr_port, uint32_t 
nr_switch_core, uint32_t mbuf_size,
                rte_exit(EXIT_FAILURE, "Cannot create mbuf pool\n");
 }
 
+static void
+reset_dma(void)
+{
+       int i;
+
+       for (i = 0; i < RTE_MAX_VHOST_DEVICE; i++) {
+               int j;
+
+               for (j = 0; j < RTE_MAX_QUEUES_PER_PORT * 2; j++) {
+                       dma_bind[i].dmas[j].dev_id = INVALID_DMA_ID;
+                       dma_bind[i].dmas[j].async_enabled = false;
+               }
+       }
+
+       for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++)
+               dmas_id[i] = INVALID_DMA_ID;
+}
+
 /*
  * Main function, does initialisation and calls the per-lcore functions.
  */
@@ -1679,6 +1823,9 @@ main(int argc, char *argv[])
        argc -= ret;
        argv += ret;
 
+       /* initialize dma structures */
+       reset_dma();
+
        /* parse app arguments */
        ret = us_vhost_parse_args(argc, argv);
        if (ret < 0)
@@ -1754,11 +1901,21 @@ main(int argc, char *argv[])
        if (client_mode)
                flags |= RTE_VHOST_USER_CLIENT;
 
+       if (dma_count) {
+               if (rte_vhost_async_dma_configure(dmas_id, dma_count, 1) < 0) {
+                       RTE_LOG(ERR, VHOST_PORT, "Failed to configure DMA in 
vhost.\n");
+                       for (i = 0; i < dma_count; i++)
+                               if (dmas_id[i] >= 0)
+                                       rte_dma_stop(dmas_id[i]);
+                       rte_exit(EXIT_FAILURE, "Cannot use given DMA 
devices\n");
+               }
+       }
+
        /* Register vhost user driver to handle vhost messages. */
        for (i = 0; i < nb_sockets; i++) {
                char *file = socket_files + i * PATH_MAX;
 
-               if (async_vhost_driver)
+               if (dma_count)
                        flags = flags | RTE_VHOST_USER_ASYNC_COPY;
 
                ret = rte_vhost_driver_register(file, flags);
diff --git a/examples/vhost/main.h b/examples/vhost/main.h
index e7b1ac60a6..b4a453e77e 100644
--- a/examples/vhost/main.h
+++ b/examples/vhost/main.h
@@ -8,6 +8,7 @@
 #include <sys/queue.h>
 
 #include <rte_ether.h>
+#include <rte_pci.h>
 
 /* Macros for printing using RTE_LOG */
 #define RTE_LOGTYPE_VHOST_CONFIG RTE_LOGTYPE_USER1
@@ -79,6 +80,16 @@ struct lcore_info {
        struct vhost_dev_tailq_list vdev_list;
 };
 
+struct dma_info {
+       struct rte_pci_addr addr;
+       int16_t dev_id;
+       bool async_enabled;
+};
+
+struct dma_for_vhost {
+       struct dma_info dmas[RTE_MAX_QUEUES_PER_PORT * 2];
+};
+
 /* we implement non-extra virtio net features */
 #define VIRTIO_NET_FEATURES    0
 
diff --git a/examples/vhost/meson.build b/examples/vhost/meson.build
index 3efd5e6540..87a637f83f 100644
--- a/examples/vhost/meson.build
+++ b/examples/vhost/meson.build
@@ -12,13 +12,9 @@ if not is_linux
 endif
 
 deps += 'vhost'
+deps += 'dmadev'
 allow_experimental_apis = true
 sources = files(
         'main.c',
         'virtio_net.c',
 )
-
-if dpdk_conf.has('RTE_RAW_IOAT')
-    deps += 'raw_ioat'
-    sources += files('ioat.c')
-endif
diff --git a/lib/vhost/meson.build b/lib/vhost/meson.build
index cdb37a4814..bc7272053b 100644
--- a/lib/vhost/meson.build
+++ b/lib/vhost/meson.build
@@ -36,4 +36,4 @@ headers = files(
 driver_sdk_headers = files(
         'vdpa_driver.h',
 )
-deps += ['ethdev', 'cryptodev', 'hash', 'pci']
+deps += ['ethdev', 'cryptodev', 'hash', 'pci', 'dmadev']
diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h
index b454c05868..15c37dd26e 100644
--- a/lib/vhost/rte_vhost.h
+++ b/lib/vhost/rte_vhost.h
@@ -113,6 +113,8 @@ extern "C" {
 #define VHOST_USER_F_PROTOCOL_FEATURES 30
 #endif
 
+#define RTE_MAX_VHOST_DEVICE   1024
+
 struct rte_vdpa_device;
 
 /**
diff --git a/lib/vhost/rte_vhost_async.h b/lib/vhost/rte_vhost_async.h
index a87ea6ba37..758a80f403 100644
--- a/lib/vhost/rte_vhost_async.h
+++ b/lib/vhost/rte_vhost_async.h
@@ -26,73 +26,6 @@ struct rte_vhost_iov_iter {
        unsigned long nr_segs;
 };
 
-/**
- * dma transfer status
- */
-struct rte_vhost_async_status {
-       /** An array of application specific data for source memory */
-       uintptr_t *src_opaque_data;
-       /** An array of application specific data for destination memory */
-       uintptr_t *dst_opaque_data;
-};
-
-/**
- * dma operation callbacks to be implemented by applications
- */
-struct rte_vhost_async_channel_ops {
-       /**
-        * instruct async engines to perform copies for a batch of packets
-        *
-        * @param vid
-        *  id of vhost device to perform data copies
-        * @param queue_id
-        *  queue id to perform data copies
-        * @param iov_iter
-        *  an array of IOV iterators
-        * @param opaque_data
-        *  opaque data pair sending to DMA engine
-        * @param count
-        *  number of elements in the "descs" array
-        * @return
-        *  number of IOV iterators processed, negative value means error
-        */
-       int32_t (*transfer_data)(int vid, uint16_t queue_id,
-               struct rte_vhost_iov_iter *iov_iter,
-               struct rte_vhost_async_status *opaque_data,
-               uint16_t count);
-       /**
-        * check copy-completed packets from the async engine
-        * @param vid
-        *  id of vhost device to check copy completion
-        * @param queue_id
-        *  queue id to check copy completion
-        * @param opaque_data
-        *  buffer to receive the opaque data pair from DMA engine
-        * @param max_packets
-        *  max number of packets could be completed
-        * @return
-        *  number of async descs completed, negative value means error
-        */
-       int32_t (*check_completed_copies)(int vid, uint16_t queue_id,
-               struct rte_vhost_async_status *opaque_data,
-               uint16_t max_packets);
-};
-
-/**
- *  async channel features
- */
-enum {
-       RTE_VHOST_ASYNC_INORDER = 1U << 0,
-};
-
-/**
- *  async channel configuration
- */
-struct rte_vhost_async_config {
-       uint32_t features;
-       uint32_t rsvd[2];
-};
-
 /**
  * Register an async channel for a vhost queue
  *
@@ -100,17 +33,11 @@ struct rte_vhost_async_config {
  *  vhost device id async channel to be attached to
  * @param queue_id
  *  vhost queue id async channel to be attached to
- * @param config
- *  Async channel configuration structure
- * @param ops
- *  Async channel operation callbacks
  * @return
  *  0 on success, -1 on failures
  */
 __rte_experimental
-int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
-       struct rte_vhost_async_config config,
-       struct rte_vhost_async_channel_ops *ops);
+int rte_vhost_async_channel_register(int vid, uint16_t queue_id);
 
 /**
  * Unregister an async channel for a vhost queue
@@ -136,17 +63,11 @@ int rte_vhost_async_channel_unregister(int vid, uint16_t 
queue_id);
  *  vhost device id async channel to be attached to
  * @param queue_id
  *  vhost queue id async channel to be attached to
- * @param config
- *  Async channel configuration
- * @param ops
- *  Async channel operation callbacks
  * @return
  *  0 on success, -1 on failures
  */
 __rte_experimental
-int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
-       struct rte_vhost_async_config config,
-       struct rte_vhost_async_channel_ops *ops);
+int rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id);
 
 /**
  * Unregister an async channel for a vhost queue without performing any
@@ -179,12 +100,17 @@ int rte_vhost_async_channel_unregister_thread_unsafe(int 
vid,
  *  array of packets to be enqueued
  * @param count
  *  packets num to be enqueued
+ * @param dma_id
+ *  the identifier of the DMA device
+ * @param vchan_id
+ *  the identifier of virtual DMA channel
  * @return
  *  num of packets enqueued
  */
 __rte_experimental
 uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
-               struct rte_mbuf **pkts, uint16_t count);
+               struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+               uint16_t vchan_id);
 
 /**
  * This function checks async completion status for a specific vhost
@@ -199,12 +125,17 @@ uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t 
queue_id,
  *  blank array to get return packet pointer
  * @param count
  *  size of the packet array
+ * @param dma_id
+ *  the identifier of the DMA device
+ * @param vchan_id
+ *  the identifier of virtual DMA channel
  * @return
  *  num of packets returned
  */
 __rte_experimental
 uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
-               struct rte_mbuf **pkts, uint16_t count);
+               struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+               uint16_t vchan_id);
 
 /**
  * This function returns the amount of in-flight packets for the vhost
@@ -235,11 +166,44 @@ int rte_vhost_async_get_inflight(int vid, uint16_t 
queue_id);
  *  Blank array to get return packet pointer
  * @param count
  *  Size of the packet array
+ * @param dma_id
+ *  the identifier of the DMA device
+ * @param vchan_id
+ *  the identifier of virtual DMA channel
  * @return
  *  Number of packets returned
  */
 __rte_experimental
 uint16_t rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
-               struct rte_mbuf **pkts, uint16_t count);
+               struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+               uint16_t vchan_id);
+/**
+ * The DMA vChannels used in asynchronous data path must be configured
+ * first. So this function needs to be called before enabling DMA
+ * acceleration for vring. If this function fails, asynchronous data path
+ * cannot be enabled for any vring further.
+ *
+ * DMA devices used in data-path must belong to DMA devices given in this
+ * function. But users are free to use DMA devices given in the function
+ * in non-vhost scenarios, only if guarantee no copies in vhost are
+ * offloaded to them at the same time.
+ *
+ * @param dmas_id
+ *  DMA ID array
+ * @param count
+ *  Element number of 'dmas_id'
+ * @param poll_factor
+ *  For large or scatter-gather packets, one packet would consist of
+ *  small buffers. In this case, vhost will issue several DMA copy
+ *  operations for the packet. Therefore, the number of copies to
+ *  check by rte_dma_completed() is calculated by "nb_pkts_to_poll *
+ *  poll_factor" andused in rte_vhost_poll_enqueue_completed(). The
+ *  default value of "poll_factor" is 1.
+ * @return
+ *  0 on success, and -1 on failure
+ */
+__rte_experimental
+int rte_vhost_async_dma_configure(int16_t *dmas_id, uint16_t count,
+               uint16_t poll_factor);
 
 #endif /* _RTE_VHOST_ASYNC_H_ */
diff --git a/lib/vhost/version.map b/lib/vhost/version.map
index a7ef7f1976..1202ba9c1a 100644
--- a/lib/vhost/version.map
+++ b/lib/vhost/version.map
@@ -84,6 +84,9 @@ EXPERIMENTAL {
 
        # added in 21.11
        rte_vhost_get_monitor_addr;
+
+       # added in 22.03
+       rte_vhost_async_dma_configure;
 };
 
 INTERNAL {
diff --git a/lib/vhost/vhost.c b/lib/vhost/vhost.c
index 13a9bb9dd1..c408cee63e 100644
--- a/lib/vhost/vhost.c
+++ b/lib/vhost/vhost.c
@@ -25,7 +25,7 @@
 #include "vhost.h"
 #include "vhost_user.h"
 
-struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
+struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
 pthread_mutex_t vhost_dev_lock = PTHREAD_MUTEX_INITIALIZER;
 
 /* Called with iotlb_lock read-locked */
@@ -344,6 +344,7 @@ vhost_free_async_mem(struct vhost_virtqueue *vq)
                return;
 
        rte_free(vq->async->pkts_info);
+       rte_free(vq->async->pkts_cmpl_flag);
 
        rte_free(vq->async->buffers_packed);
        vq->async->buffers_packed = NULL;
@@ -667,12 +668,12 @@ vhost_new_device(void)
        int i;
 
        pthread_mutex_lock(&vhost_dev_lock);
-       for (i = 0; i < MAX_VHOST_DEVICE; i++) {
+       for (i = 0; i < RTE_MAX_VHOST_DEVICE; i++) {
                if (vhost_devices[i] == NULL)
                        break;
        }
 
-       if (i == MAX_VHOST_DEVICE) {
+       if (i == RTE_MAX_VHOST_DEVICE) {
                VHOST_LOG_CONFIG(ERR,
                        "Failed to find a free slot for new device.\n");
                pthread_mutex_unlock(&vhost_dev_lock);
@@ -1626,8 +1627,7 @@ rte_vhost_extern_callback_register(int vid,
 }
 
 static __rte_always_inline int
-async_channel_register(int vid, uint16_t queue_id,
-               struct rte_vhost_async_channel_ops *ops)
+async_channel_register(int vid, uint16_t queue_id)
 {
        struct virtio_net *dev = get_device(vid);
        struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
@@ -1656,6 +1656,14 @@ async_channel_register(int vid, uint16_t queue_id,
                goto out_free_async;
        }
 
+       async->pkts_cmpl_flag = rte_zmalloc_socket(NULL, vq->size * 
sizeof(bool),
+                       RTE_CACHE_LINE_SIZE, node);
+       if (!async->pkts_cmpl_flag) {
+               VHOST_LOG_CONFIG(ERR, "failed to allocate async pkts_cmpl_flag 
(vid %d, qid: %d)\n",
+                               vid, queue_id);
+               goto out_free_async;
+       }
+
        if (vq_is_packed(dev)) {
                async->buffers_packed = rte_malloc_socket(NULL,
                                vq->size * sizeof(struct 
vring_used_elem_packed),
@@ -1676,9 +1684,6 @@ async_channel_register(int vid, uint16_t queue_id,
                }
        }
 
-       async->ops.check_completed_copies = ops->check_completed_copies;
-       async->ops.transfer_data = ops->transfer_data;
-
        vq->async = async;
 
        return 0;
@@ -1691,15 +1696,13 @@ async_channel_register(int vid, uint16_t queue_id,
 }
 
 int
-rte_vhost_async_channel_register(int vid, uint16_t queue_id,
-               struct rte_vhost_async_config config,
-               struct rte_vhost_async_channel_ops *ops)
+rte_vhost_async_channel_register(int vid, uint16_t queue_id)
 {
        struct vhost_virtqueue *vq;
        struct virtio_net *dev = get_device(vid);
        int ret;
 
-       if (dev == NULL || ops == NULL)
+       if (dev == NULL)
                return -1;
 
        if (queue_id >= VHOST_MAX_VRING)
@@ -1710,33 +1713,20 @@ rte_vhost_async_channel_register(int vid, uint16_t 
queue_id,
        if (unlikely(vq == NULL || !dev->async_copy))
                return -1;
 
-       if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
-               VHOST_LOG_CONFIG(ERR,
-                       "async copy is not supported on non-inorder mode "
-                       "(vid %d, qid: %d)\n", vid, queue_id);
-               return -1;
-       }
-
-       if (unlikely(ops->check_completed_copies == NULL ||
-               ops->transfer_data == NULL))
-               return -1;
-
        rte_spinlock_lock(&vq->access_lock);
-       ret = async_channel_register(vid, queue_id, ops);
+       ret = async_channel_register(vid, queue_id);
        rte_spinlock_unlock(&vq->access_lock);
 
        return ret;
 }
 
 int
-rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id,
-               struct rte_vhost_async_config config,
-               struct rte_vhost_async_channel_ops *ops)
+rte_vhost_async_channel_register_thread_unsafe(int vid, uint16_t queue_id)
 {
        struct vhost_virtqueue *vq;
        struct virtio_net *dev = get_device(vid);
 
-       if (dev == NULL || ops == NULL)
+       if (dev == NULL)
                return -1;
 
        if (queue_id >= VHOST_MAX_VRING)
@@ -1747,18 +1737,7 @@ rte_vhost_async_channel_register_thread_unsafe(int vid, 
uint16_t queue_id,
        if (unlikely(vq == NULL || !dev->async_copy))
                return -1;
 
-       if (unlikely(!(config.features & RTE_VHOST_ASYNC_INORDER))) {
-               VHOST_LOG_CONFIG(ERR,
-                       "async copy is not supported on non-inorder mode "
-                       "(vid %d, qid: %d)\n", vid, queue_id);
-               return -1;
-       }
-
-       if (unlikely(ops->check_completed_copies == NULL ||
-               ops->transfer_data == NULL))
-               return -1;
-
-       return async_channel_register(vid, queue_id, ops);
+       return async_channel_register(vid, queue_id);
 }
 
 int
@@ -1835,6 +1814,95 @@ rte_vhost_async_channel_unregister_thread_unsafe(int 
vid, uint16_t queue_id)
        return 0;
 }
 
+static __rte_always_inline void
+vhost_free_async_dma_mem(void)
+{
+       uint16_t i;
+
+       for (i = 0; i < RTE_DMADEV_DEFAULT_MAX; i++) {
+               struct async_dma_info *dma = &dma_copy_track[i];
+               int16_t j;
+
+               if (dma->max_vchans == 0)
+                       continue;
+
+               for (j = 0; j < dma->max_vchans; j++)
+                       rte_free(dma->vchans[j].pkts_completed_flag);
+
+               rte_free(dma->vchans);
+               dma->vchans = NULL;
+               dma->max_vchans = 0;
+       }
+}
+
+int
+rte_vhost_async_dma_configure(int16_t *dmas_id, uint16_t count, uint16_t 
poll_factor)
+{
+       uint16_t i;
+
+       if (!dmas_id) {
+               VHOST_LOG_CONFIG(ERR, "Invalid DMA configuration parameter.\n");
+               return -1;
+       }
+
+       if (poll_factor == 0) {
+               VHOST_LOG_CONFIG(ERR, "Invalid DMA poll factor %u\n", 
poll_factor);
+               return -1;
+       }
+       dma_poll_factor = poll_factor;
+
+       for (i = 0; i < count; i++) {
+               struct async_dma_vchan_info *vchans;
+               struct rte_dma_info info;
+               uint16_t max_vchans;
+               uint16_t max_desc;
+               uint16_t j;
+
+               if (!rte_dma_is_valid(dmas_id[i])) {
+                       VHOST_LOG_CONFIG(ERR, "DMA %d is not found. Cannot 
enable async"
+                                      " data-path\n.", dmas_id[i]);
+                       vhost_free_async_dma_mem();
+                       return -1;
+               }
+
+               rte_dma_info_get(dmas_id[i], &info);
+
+               max_vchans = info.max_vchans;
+               max_desc = info.max_desc;
+
+               if (!rte_is_power_of_2(max_desc))
+                       max_desc = rte_align32pow2(max_desc);
+
+               vchans = rte_zmalloc(NULL, sizeof(struct async_dma_vchan_info) 
* max_vchans,
+                               RTE_CACHE_LINE_SIZE);
+               if (vchans == NULL) {
+                       VHOST_LOG_CONFIG(ERR, "Failed to allocate vchans for 
dma-%d."
+                                       " Cannot enable async data-path.\n", 
dmas_id[i]);
+                       vhost_free_async_dma_mem();
+                       return -1;
+               }
+
+               for (j = 0; j < max_vchans; j++) {
+                       vchans[j].pkts_completed_flag = rte_zmalloc(NULL, 
sizeof(bool *) * max_desc,
+                                       RTE_CACHE_LINE_SIZE);
+                       if (!vchans[j].pkts_completed_flag) {
+                               VHOST_LOG_CONFIG(ERR, "Failed to allocate  
pkts_completed_flag for "
+                                               "dma-%d vchan-%u\n", 
dmas_id[i], j);
+                               vhost_free_async_dma_mem();
+                               return -1;
+                       }
+
+                       vchans[j].ring_size = max_desc;
+                       vchans[j].ring_mask = max_desc - 1;
+               }
+
+               dma_copy_track[dmas_id[i]].vchans = vchans;
+               dma_copy_track[dmas_id[i]].max_vchans = max_vchans;
+       }
+
+       return 0;
+}
+
 int
 rte_vhost_async_get_inflight(int vid, uint16_t queue_id)
 {
diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
index 7085e0885c..475843fec0 100644
--- a/lib/vhost/vhost.h
+++ b/lib/vhost/vhost.h
@@ -19,6 +19,7 @@
 #include <rte_ether.h>
 #include <rte_rwlock.h>
 #include <rte_malloc.h>
+#include <rte_dmadev.h>
 
 #include "rte_vhost.h"
 #include "rte_vdpa.h"
@@ -50,6 +51,7 @@
 
 #define VHOST_MAX_ASYNC_IT (MAX_PKT_BURST)
 #define VHOST_MAX_ASYNC_VEC 2048
+#define VHOST_ASYNC_DMA_BATCHING_SIZE 32
 
 #define PACKED_DESC_ENQUEUE_USED_FLAG(w)       \
        ((w) ? (VRING_DESC_F_AVAIL | VRING_DESC_F_USED | VRING_DESC_F_WRITE) : \
@@ -119,6 +121,42 @@ struct vring_used_elem_packed {
        uint32_t count;
 };
 
+struct async_dma_vchan_info {
+       /* circular array to track if packet copy completes */
+       bool **pkts_completed_flag;
+
+       /* max elements in 'metadata' */
+       uint16_t ring_size;
+       /* ring index mask for 'metadata' */
+       uint16_t ring_mask;
+
+       /* batching copies before a DMA doorbell */
+       uint16_t nr_batching;
+
+       /**
+        * DMA virtual channel lock. Although it is able to bind DMA
+        * virtual channels to data plane threads, vhost control plane
+        * thread could call data plane functions too, thus causing
+        * DMA device contention.
+        *
+        * For example, in VM exit case, vhost control plane thread needs
+        * to clear in-flight packets before disable vring, but there could
+        * be anotther data plane thread is enqueuing packets to the same
+        * vring with the same DMA virtual channel. But dmadev PMD functions
+        * are lock-free, so the control plane and data plane threads
+        * could operate the same DMA virtual channel at the same time.
+        */
+       rte_spinlock_t dma_lock;
+};
+
+struct async_dma_info {
+       uint16_t max_vchans;
+       struct async_dma_vchan_info *vchans;
+};
+
+extern struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
+extern uint16_t dma_poll_factor;
+
 /**
  * inflight async packet information
  */
@@ -129,9 +167,6 @@ struct async_inflight_info {
 };
 
 struct vhost_async {
-       /* operation callbacks for DMA */
-       struct rte_vhost_async_channel_ops ops;
-
        struct rte_vhost_iov_iter iov_iter[VHOST_MAX_ASYNC_IT];
        struct rte_vhost_iovec iovec[VHOST_MAX_ASYNC_VEC];
        uint16_t iter_idx;
@@ -139,6 +174,25 @@ struct vhost_async {
 
        /* data transfer status */
        struct async_inflight_info *pkts_info;
+       /**
+        * Packet reorder array. "true" indicates that DMA device
+        * completes all copies for the packet.
+        *
+        * Note that this array could be written by multiple threads
+        * simultaneously. For example, in the case of thread0 and
+        * thread1 RX packets from NIC and then enqueue packets to
+        * vring0 and vring1 with own DMA device DMA0 and DMA1, it's
+        * possible for thread0 to get completed copies belonging to
+        * vring1 from DMA0, while thread0 is calling rte_vhost_poll
+        * _enqueue_completed() for vring0 and thread1 is calling
+        * rte_vhost_submit_enqueue_burst() for vring1. In this case,
+        * vq->access_lock cannot protect pkts_cmpl_flag of vring1.
+        *
+        * However, since offloading is per-packet basis, each packet
+        * flag will only be written by one thread. And single byte
+        * write is atomic, so no lock for pkts_cmpl_flag is needed.
+        */
+       bool *pkts_cmpl_flag;
        uint16_t pkts_idx;
        uint16_t pkts_inflight_n;
        union {
@@ -198,6 +252,7 @@ struct vhost_virtqueue {
        /* Record packed ring first dequeue desc index */
        uint16_t                shadow_last_used_idx;
 
+       uint16_t                batch_copy_max_elems;
        uint16_t                batch_copy_nb_elems;
        struct batch_copy_elem  *batch_copy_elems;
        int                     numa_node;
@@ -568,8 +623,7 @@ extern int vhost_data_log_level;
 #define PRINT_PACKET(device, addr, size, header) do {} while (0)
 #endif
 
-#define MAX_VHOST_DEVICE       1024
-extern struct virtio_net *vhost_devices[MAX_VHOST_DEVICE];
+extern struct virtio_net *vhost_devices[RTE_MAX_VHOST_DEVICE];
 
 #define VHOST_BINARY_SEARCH_THRESH 256
 
diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
index 5eb1dd6812..3147e72f04 100644
--- a/lib/vhost/vhost_user.c
+++ b/lib/vhost/vhost_user.c
@@ -527,6 +527,8 @@ vhost_user_set_vring_num(struct virtio_net **pdev,
                return RTE_VHOST_MSG_RESULT_ERR;
        }
 
+       vq->batch_copy_max_elems = vq->size;
+
        return RTE_VHOST_MSG_RESULT_OK;
 }
 
diff --git a/lib/vhost/virtio_net.c b/lib/vhost/virtio_net.c
index b3d954aab4..305f6cd562 100644
--- a/lib/vhost/virtio_net.c
+++ b/lib/vhost/virtio_net.c
@@ -11,6 +11,7 @@
 #include <rte_net.h>
 #include <rte_ether.h>
 #include <rte_ip.h>
+#include <rte_dmadev.h>
 #include <rte_vhost.h>
 #include <rte_tcp.h>
 #include <rte_udp.h>
@@ -25,6 +26,10 @@
 
 #define MAX_BATCH_LEN 256
 
+/* DMA device copy operation tracking array. */
+struct async_dma_info dma_copy_track[RTE_DMADEV_DEFAULT_MAX];
+uint16_t dma_poll_factor = 1;
+
 static  __rte_always_inline bool
 rxvq_is_mergeable(struct virtio_net *dev)
 {
@@ -43,6 +48,140 @@ is_valid_virt_queue_idx(uint32_t idx, int is_tx, uint32_t 
nr_vring)
        return (is_tx ^ (idx & 1)) == 0 && idx < nr_vring;
 }
 
+static __rte_always_inline uint16_t
+vhost_async_dma_transfer(struct vhost_virtqueue *vq, int16_t dma_id,
+               uint16_t vchan_id, uint16_t head_idx,
+               struct rte_vhost_iov_iter *pkts, uint16_t nr_pkts)
+{
+       struct async_dma_vchan_info *dma_info = 
&dma_copy_track[dma_id].vchans[vchan_id];
+       uint16_t ring_mask = dma_info->ring_mask;
+       uint16_t pkt_idx, bce_idx = 0;
+
+       rte_spinlock_lock(&dma_info->dma_lock);
+
+       for (pkt_idx = 0; pkt_idx < nr_pkts; pkt_idx++) {
+               struct rte_vhost_iovec *iov = pkts[pkt_idx].iov;
+               int copy_idx, last_copy_idx = 0;
+               uint16_t nr_segs = pkts[pkt_idx].nr_segs;
+               uint16_t nr_sw_copy = 0;
+               uint16_t i;
+
+               if (rte_dma_burst_capacity(dma_id, vchan_id) < nr_segs)
+                       goto out;
+
+               for (i = 0; i < nr_segs; i++) {
+                       /* Fallback to SW copy if error happens */
+                       copy_idx = rte_dma_copy(dma_id, vchan_id, 
(rte_iova_t)iov[i].src_addr,
+                                       (rte_iova_t)iov[i].dst_addr, iov[i].len,
+                                       RTE_DMA_OP_FLAG_LLC);
+                       if (unlikely(copy_idx < 0)) {
+                               /* Find corresponding VA pair and do SW copy */
+                               rte_memcpy(vq->batch_copy_elems[bce_idx].dst,
+                                               
vq->batch_copy_elems[bce_idx].src,
+                                               
vq->batch_copy_elems[bce_idx].len);
+                               nr_sw_copy++;
+
+                               /**
+                                * All copies of the packet are performed
+                                * by the CPU, set the packet completion flag
+                                * to true, as all copies are done.
+                                */
+                               if (nr_sw_copy == nr_segs) {
+                                       vq->async->pkts_cmpl_flag[head_idx % 
vq->size] = true;
+                                       break;
+                               } else if (i == (nr_segs - 1)) {
+                                       /**
+                                        * A part of copies of current packet
+                                        * are enqueued to the DMA successfully
+                                        * but the last copy fails, store the
+                                        * packet completion flag address
+                                        * in the last DMA copy slot.
+                                        */
+                                       
dma_info->pkts_completed_flag[last_copy_idx & ring_mask] =
+                                               
&vq->async->pkts_cmpl_flag[head_idx % vq->size];
+                                       break;
+                               }
+                       } else
+                               last_copy_idx = copy_idx;
+
+                       bce_idx++;
+
+                       /**
+                        * Only store packet completion flag address in the 
last copy's
+                        * slot, and other slots are set to NULL.
+                        */
+                       if (i == (nr_segs - 1)) {
+                               dma_info->pkts_completed_flag[copy_idx & 
ring_mask] =
+                                       &vq->async->pkts_cmpl_flag[head_idx % 
vq->size];
+                       }
+               }
+
+               dma_info->nr_batching += nr_segs;
+               if (unlikely(dma_info->nr_batching >= 
VHOST_ASYNC_DMA_BATCHING_SIZE)) {
+                       rte_dma_submit(dma_id, vchan_id);
+                       dma_info->nr_batching = 0;
+               }
+
+               head_idx++;
+       }
+
+out:
+       if (dma_info->nr_batching > 0) {
+               rte_dma_submit(dma_id, vchan_id);
+               dma_info->nr_batching = 0;
+       }
+       rte_spinlock_unlock(&dma_info->dma_lock);
+       vq->batch_copy_nb_elems = 0;
+
+       return pkt_idx;
+}
+
+static __rte_always_inline uint16_t
+vhost_async_dma_check_completed(int16_t dma_id, uint16_t vchan_id, uint16_t 
max_pkts)
+{
+       struct async_dma_vchan_info *dma_info = 
&dma_copy_track[dma_id].vchans[vchan_id];
+       uint16_t ring_mask = dma_info->ring_mask;
+       uint16_t last_idx = 0;
+       uint16_t nr_copies;
+       uint16_t copy_idx;
+       uint16_t i;
+       bool has_error = false;
+
+       rte_spinlock_lock(&dma_info->dma_lock);
+
+       /**
+        * Print error log for debugging, if DMA reports error during
+        * DMA transfer. We do not handle error in vhost level.
+        */
+       nr_copies = rte_dma_completed(dma_id, vchan_id, max_pkts, &last_idx, 
&has_error);
+       if (unlikely(has_error)) {
+               VHOST_LOG_DATA(ERR, "dma %d vchannel %u reports error in 
rte_dma_completed()\n",
+                               dma_id, vchan_id);
+       } else if (nr_copies == 0)
+               goto out;
+
+       copy_idx = last_idx - nr_copies + 1;
+       for (i = 0; i < nr_copies; i++) {
+               bool *flag;
+
+               flag = dma_info->pkts_completed_flag[copy_idx & ring_mask];
+               if (flag) {
+                       /**
+                        * Mark the packet flag as received. The flag
+                        * could belong to another virtqueue but write
+                        * is atomic.
+                        */
+                       *flag = true;
+                       dma_info->pkts_completed_flag[copy_idx & ring_mask] = 
NULL;
+               }
+               copy_idx++;
+       }
+
+out:
+       rte_spinlock_unlock(&dma_info->dma_lock);
+       return nr_copies;
+}
+
 static inline void
 do_data_copy_enqueue(struct virtio_net *dev, struct vhost_virtqueue *vq)
 {
@@ -865,12 +1004,13 @@ async_iter_reset(struct vhost_async *async)
 static __rte_always_inline int
 async_mbuf_to_desc_seg(struct virtio_net *dev, struct vhost_virtqueue *vq,
                struct rte_mbuf *m, uint32_t mbuf_offset,
-               uint64_t buf_iova, uint32_t cpy_len)
+               uint64_t buf_addr, uint64_t buf_iova, uint32_t cpy_len)
 {
        struct vhost_async *async = vq->async;
        uint64_t mapped_len;
        uint32_t buf_offset = 0;
        void *hpa;
+       struct batch_copy_elem *bce = vq->batch_copy_elems;
 
        while (cpy_len) {
                hpa = (void *)(uintptr_t)gpa_to_first_hpa(dev,
@@ -886,6 +1026,31 @@ async_mbuf_to_desc_seg(struct virtio_net *dev, struct 
vhost_virtqueue *vq,
                                                hpa, (size_t)mapped_len)))
                        return -1;
 
+               /**
+                * Keep VA for all IOVA segments for falling back to SW
+                * copy in case of rte_dma_copy() error.
+                */
+               if (unlikely(vq->batch_copy_nb_elems >= 
vq->batch_copy_max_elems)) {
+                       struct batch_copy_elem *tmp;
+                       uint16_t nb_elems = 2 * vq->batch_copy_max_elems;
+
+                       VHOST_LOG_DATA(DEBUG, "(%d) %s: run out of 
batch_copy_elems, "
+                                       "and realloc double elements.\n", 
dev->vid, __func__);
+                       tmp = rte_realloc_socket(vq->batch_copy_elems, nb_elems 
* sizeof(*tmp),
+                                       RTE_CACHE_LINE_SIZE, vq->numa_node);
+                       if (!tmp) {
+                               VHOST_LOG_DATA(ERR, "Failed to re-alloc 
batch_copy_elems\n");
+                               return -1;
+                       }
+
+                       vq->batch_copy_max_elems = nb_elems;
+                       vq->batch_copy_elems = tmp;
+                       bce = tmp;
+               }
+               bce[vq->batch_copy_nb_elems].dst = (void 
*)((uintptr_t)(buf_addr + buf_offset));
+               bce[vq->batch_copy_nb_elems].src = rte_pktmbuf_mtod_offset(m, 
void *, mbuf_offset);
+               bce[vq->batch_copy_nb_elems++].len = mapped_len;
+
                cpy_len -= (uint32_t)mapped_len;
                mbuf_offset += (uint32_t)mapped_len;
                buf_offset += (uint32_t)mapped_len;
@@ -901,7 +1066,8 @@ sync_mbuf_to_desc_seg(struct virtio_net *dev, struct 
vhost_virtqueue *vq,
 {
        struct batch_copy_elem *batch_copy = vq->batch_copy_elems;
 
-       if (likely(cpy_len > MAX_BATCH_LEN || vq->batch_copy_nb_elems >= 
vq->size)) {
+       if (likely(cpy_len > MAX_BATCH_LEN ||
+                               vq->batch_copy_nb_elems >= 
vq->batch_copy_max_elems)) {
                rte_memcpy((void *)((uintptr_t)(buf_addr)),
                                rte_pktmbuf_mtod_offset(m, void *, mbuf_offset),
                                cpy_len);
@@ -1020,8 +1186,10 @@ mbuf_to_desc(struct virtio_net *dev, struct 
vhost_virtqueue *vq,
 
                if (is_async) {
                        if (async_mbuf_to_desc_seg(dev, vq, m, mbuf_offset,
+                                               buf_addr + buf_offset,
                                                buf_iova + buf_offset, cpy_len) 
< 0)
                                goto error;
+
                } else {
                        sync_mbuf_to_desc_seg(dev, vq, m, mbuf_offset,
                                        buf_addr + buf_offset,
@@ -1449,9 +1617,9 @@ store_dma_desc_info_packed(struct vring_used_elem_packed 
*s_ring,
 }
 
 static __rte_noinline uint32_t
-virtio_dev_rx_async_submit_split(struct virtio_net *dev,
-       struct vhost_virtqueue *vq, uint16_t queue_id,
-       struct rte_mbuf **pkts, uint32_t count)
+virtio_dev_rx_async_submit_split(struct virtio_net *dev, struct 
vhost_virtqueue *vq,
+               uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
+               int16_t dma_id, uint16_t vchan_id)
 {
        struct buf_vector buf_vec[BUF_VECTOR_MAX];
        uint32_t pkt_idx = 0;
@@ -1503,17 +1671,16 @@ virtio_dev_rx_async_submit_split(struct virtio_net *dev,
        if (unlikely(pkt_idx == 0))
                return 0;
 
-       n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 
0, pkt_idx);
-       if (unlikely(n_xfer < 0)) {
-               VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer data for queue 
id %d.\n",
-                               dev->vid, __func__, queue_id);
-               n_xfer = 0;
-       }
+       n_xfer = vhost_async_dma_transfer(vq, dma_id, vchan_id, 
async->pkts_idx, async->iov_iter,
+                       pkt_idx);
 
        pkt_err = pkt_idx - n_xfer;
        if (unlikely(pkt_err)) {
                uint16_t num_descs = 0;
 
+               VHOST_LOG_DATA(DEBUG, "(%d) %s: failed to transfer %u packets 
for queue %u.\n",
+                               dev->vid, __func__, pkt_err, queue_id);
+
                /* update number of completed packets */
                pkt_idx = n_xfer;
 
@@ -1656,13 +1823,13 @@ dma_error_handler_packed(struct vhost_virtqueue *vq, 
uint16_t slot_idx,
 }
 
 static __rte_noinline uint32_t
-virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
-       struct vhost_virtqueue *vq, uint16_t queue_id,
-       struct rte_mbuf **pkts, uint32_t count)
+virtio_dev_rx_async_submit_packed(struct virtio_net *dev, struct 
vhost_virtqueue *vq,
+               uint16_t queue_id, struct rte_mbuf **pkts, uint32_t count,
+               int16_t dma_id, uint16_t vchan_id)
 {
        uint32_t pkt_idx = 0;
        uint32_t remained = count;
-       int32_t n_xfer;
+       uint16_t n_xfer;
        uint16_t num_buffers;
        uint16_t num_descs;
 
@@ -1670,6 +1837,7 @@ virtio_dev_rx_async_submit_packed(struct virtio_net *dev,
        struct async_inflight_info *pkts_info = async->pkts_info;
        uint32_t pkt_err = 0;
        uint16_t slot_idx = 0;
+       uint16_t head_idx = async->pkts_idx % vq->size;
 
        do {
                rte_prefetch0(&vq->desc_packed[vq->last_avail_idx]);
@@ -1694,19 +1862,17 @@ virtio_dev_rx_async_submit_packed(struct virtio_net 
*dev,
        if (unlikely(pkt_idx == 0))
                return 0;
 
-       n_xfer = async->ops.transfer_data(dev->vid, queue_id, async->iov_iter, 
0, pkt_idx);
-       if (unlikely(n_xfer < 0)) {
-               VHOST_LOG_DATA(ERR, "(%d) %s: failed to transfer data for queue 
id %d.\n",
-                               dev->vid, __func__, queue_id);
-               n_xfer = 0;
-       }
-
-       pkt_err = pkt_idx - n_xfer;
+       n_xfer = vhost_async_dma_transfer(vq, dma_id, vchan_id, head_idx,
+                       async->iov_iter, pkt_idx);
 
        async_iter_reset(async);
 
-       if (unlikely(pkt_err))
+       pkt_err = pkt_idx - n_xfer;
+       if (unlikely(pkt_err)) {
+               VHOST_LOG_DATA(DEBUG, "(%d) %s: failed to transfer %u packets 
for queue %u.\n",
+                               dev->vid, __func__, pkt_err, queue_id);
                dma_error_handler_packed(vq, slot_idx, pkt_err, &pkt_idx);
+       }
 
        if (likely(vq->shadow_used_idx)) {
                /* keep used descriptors. */
@@ -1826,28 +1992,43 @@ write_back_completed_descs_packed(struct 
vhost_virtqueue *vq,
 
 static __rte_always_inline uint16_t
 vhost_poll_enqueue_completed(struct virtio_net *dev, uint16_t queue_id,
-               struct rte_mbuf **pkts, uint16_t count)
+               struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+               uint16_t vchan_id)
 {
        struct vhost_virtqueue *vq = dev->virtqueue[queue_id];
        struct vhost_async *async = vq->async;
        struct async_inflight_info *pkts_info = async->pkts_info;
-       int32_t n_cpl;
+       uint32_t max_count;
+       uint16_t nr_cpl_pkts = 0;
        uint16_t n_descs = 0, n_buffers = 0;
        uint16_t start_idx, from, i;
 
-       n_cpl = async->ops.check_completed_copies(dev->vid, queue_id, 0, count);
-       if (unlikely(n_cpl < 0)) {
-               VHOST_LOG_DATA(ERR, "(%d) %s: failed to check completed copies 
for queue id %d.\n",
-                               dev->vid, __func__, queue_id);
-               return 0;
+       /* Check completed copies for the given DMA vChannel */
+       max_count = count * dma_poll_factor;
+       vhost_async_dma_check_completed(dma_id, vchan_id, max_count <= 
UINT16_MAX ? max_count :
+                       UINT16_MAX);
+
+       start_idx = async_get_first_inflight_pkt_idx(vq);
+
+       /**
+        * Calculate the number of copy completed packets.
+        * Note that there may be completed packets even if
+        * no copies are reported done by the given DMA vChannel,
+        * as DMA vChannels could be shared by other threads.
+        */
+       from = start_idx;
+       while (vq->async->pkts_cmpl_flag[from] && count--) {
+               vq->async->pkts_cmpl_flag[from] = false;
+               from++;
+               if (from >= vq->size)
+                       from -= vq->size;
+               nr_cpl_pkts++;
        }
 
-       if (n_cpl == 0)
+       if (nr_cpl_pkts == 0)
                return 0;
 
-       start_idx = async_get_first_inflight_pkt_idx(vq);
-
-       for (i = 0; i < n_cpl; i++) {
+       for (i = 0; i < nr_cpl_pkts; i++) {
                from = (start_idx + i) % vq->size;
                /* Only used with packed ring */
                n_buffers += pkts_info[from].nr_buffers;
@@ -1856,7 +2037,7 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, 
uint16_t queue_id,
                pkts[i] = pkts_info[from].mbuf;
        }
 
-       async->pkts_inflight_n -= n_cpl;
+       async->pkts_inflight_n -= nr_cpl_pkts;
 
        if (likely(vq->enabled && vq->access_ok)) {
                if (vq_is_packed(dev)) {
@@ -1877,12 +2058,13 @@ vhost_poll_enqueue_completed(struct virtio_net *dev, 
uint16_t queue_id,
                }
        }
 
-       return n_cpl;
+       return nr_cpl_pkts;
 }
 
 uint16_t
 rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id,
-               struct rte_mbuf **pkts, uint16_t count)
+               struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+               uint16_t vchan_id)
 {
        struct virtio_net *dev = get_device(vid);
        struct vhost_virtqueue *vq;
@@ -1906,9 +2088,20 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t 
queue_id,
                return 0;
        }
 
-       rte_spinlock_lock(&vq->access_lock);
+       if (unlikely(!dma_copy_track[dma_id].vchans ||
+                               vchan_id > dma_copy_track[dma_id].max_vchans)) {
+               VHOST_LOG_DATA(ERR, "(%d) %s: invalid DMA %d vchan %u.\n",
+                              dev->vid, __func__, dma_id, vchan_id);
+               return 0;
+       }
 
-       n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
+       if (!rte_spinlock_trylock(&vq->access_lock)) {
+               VHOST_LOG_CONFIG(DEBUG, "Failed to poll completed packets from 
queue id %u. "
+                       "virt queue busy.\n", queue_id);
+               return 0;
+       }
+
+       n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, 
dma_id, vchan_id);
 
        rte_spinlock_unlock(&vq->access_lock);
 
@@ -1917,7 +2110,8 @@ rte_vhost_poll_enqueue_completed(int vid, uint16_t 
queue_id,
 
 uint16_t
 rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t queue_id,
-               struct rte_mbuf **pkts, uint16_t count)
+               struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+               uint16_t vchan_id)
 {
        struct virtio_net *dev = get_device(vid);
        struct vhost_virtqueue *vq;
@@ -1941,14 +2135,21 @@ rte_vhost_clear_queue_thread_unsafe(int vid, uint16_t 
queue_id,
                return 0;
        }
 
-       n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count);
+       if (unlikely(!dma_copy_track[dma_id].vchans ||
+                               vchan_id > dma_copy_track[dma_id].max_vchans)) {
+               VHOST_LOG_DATA(ERR, "(%d) %s: invalid DMA %d vchan %u.\n",
+                              dev->vid, __func__, dma_id, vchan_id);
+               return 0;
+       }
+
+       n_pkts_cpl = vhost_poll_enqueue_completed(dev, queue_id, pkts, count, 
dma_id, vchan_id);
 
        return n_pkts_cpl;
 }
 
 static __rte_always_inline uint32_t
 virtio_dev_rx_async_submit(struct virtio_net *dev, uint16_t queue_id,
-       struct rte_mbuf **pkts, uint32_t count)
+       struct rte_mbuf **pkts, uint32_t count, int16_t dma_id, uint16_t 
vchan_id)
 {
        struct vhost_virtqueue *vq;
        uint32_t nb_tx = 0;
@@ -1960,6 +2161,13 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, 
uint16_t queue_id,
                return 0;
        }
 
+       if (unlikely(!dma_copy_track[dma_id].vchans ||
+                               vchan_id > dma_copy_track[dma_id].max_vchans)) {
+               VHOST_LOG_DATA(ERR, "(%d) %s: invalid DMA %d vchan %u.\n", 
dev->vid, __func__,
+                               dma_id, vchan_id);
+               return 0;
+       }
+
        vq = dev->virtqueue[queue_id];
 
        rte_spinlock_lock(&vq->access_lock);
@@ -1980,10 +2188,10 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, 
uint16_t queue_id,
 
        if (vq_is_packed(dev))
                nb_tx = virtio_dev_rx_async_submit_packed(dev, vq, queue_id,
-                               pkts, count);
+                               pkts, count, dma_id, vchan_id);
        else
                nb_tx = virtio_dev_rx_async_submit_split(dev, vq, queue_id,
-                               pkts, count);
+                               pkts, count, dma_id, vchan_id);
 
 out:
        if (dev->features & (1ULL << VIRTIO_F_IOMMU_PLATFORM))
@@ -1997,7 +2205,8 @@ virtio_dev_rx_async_submit(struct virtio_net *dev, 
uint16_t queue_id,
 
 uint16_t
 rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
-               struct rte_mbuf **pkts, uint16_t count)
+               struct rte_mbuf **pkts, uint16_t count, int16_t dma_id,
+               uint16_t vchan_id)
 {
        struct virtio_net *dev = get_device(vid);
 
@@ -2011,7 +2220,7 @@ rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id,
                return 0;
        }
 
-       return virtio_dev_rx_async_submit(dev, queue_id, pkts, count);
+       return virtio_dev_rx_async_submit(dev, queue_id, pkts, count, dma_id, 
vchan_id);
 }
 
 static inline bool
@@ -2369,7 +2578,7 @@ copy_desc_to_mbuf(struct virtio_net *dev, struct 
vhost_virtqueue *vq,
                cpy_len = RTE_MIN(buf_avail, mbuf_avail);
 
                if (likely(cpy_len > MAX_BATCH_LEN ||
-                                       vq->batch_copy_nb_elems >= vq->size ||
+                                       vq->batch_copy_nb_elems >= 
vq->batch_copy_max_elems ||
                                        (hdr && cur == m))) {
                        rte_memcpy(rte_pktmbuf_mtod_offset(cur, void *,
                                                mbuf_offset),
-- 
2.25.1

[PATCH v2 1/1] vhost: integrate dmadev in asynchronous datapath

Reply via email to