Hi Maxime, > -----Original Message----- > From: Maxime Coquelin <maxime.coque...@redhat.com> > Sent: Thursday, March 19, 2020 5:10 PM > To: Hu, Jiayu <jiayu...@intel.com>; dev@dpdk.org > Cc: Ye, Xiaolong <xiaolong...@intel.com>; Wang, Zhihong > <zhihong.w...@intel.com> > Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost- > user PMD > > Hi Jiayu, > > On 3/19/20 8:33 AM, Hu, Jiayu wrote: > > Hi Maxime, > > > > Thanks for your comments. Replies are inline. > > > >> -----Original Message----- > >> From: Maxime Coquelin <maxime.coque...@redhat.com> > >> Sent: Tuesday, March 17, 2020 5:54 PM > >> To: Hu, Jiayu <jiayu...@intel.com>; dev@dpdk.org > >> Cc: Ye, Xiaolong <xiaolong...@intel.com>; Wang, Zhihong > >> <zhihong.w...@intel.com> > >> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for > vhost- > >> user PMD > >> > >> Hi Jiayu, > >> > >> On 3/17/20 10:21 AM, Jiayu Hu wrote: > >>> In vhost-user PMD's Tx operations, where data movement is heavily > >> involved, > >>> performing large memory copies usually takes up a major part of CPU > >> cycles > >>> and becomes the hot spot. To offload expensive memory operations > from > >> the > >>> CPU, this patch set proposes to leverage DMA engines, e.g., I/OAT, a > DMA > >>> engine in the Intel's processor, to accelerate large copies for > >>> vhost-user. > >>> > >>> Large copies are offloaded from the CPU to the DMA in an asynchronous > >>> manner. The CPU just submits copy jobs to the DMA but without waiting > >>> for its copy completion. Thus, there is no CPU intervention during data > >>> transfer; we can save precious CPU cycles and improve the overall > >>> throughput for vhost-user PMD based applications, like OVS. During > >>> packet transmission, it offloads large copies to the DMA and performs > >>> small copies by the CPU, due to startup overheads associated with the > DMA. > >>> > >>> vhost-user PMD is able to support various DMA engines, but it just > >>> supports I/OAT devices currently. In addition, I/OAT acceleration is only > >>> enabled for Tx operations of split rings. Users can explicitly assign a > >>> I/OAT device to a queue by the parameter 'dmas'. However, one I/OAT > >> device > >>> can only be used by one queue, and a queue can use one I/OAT device at > a > >>> time. > >>> > >>> We measure the performance in testpmd. With 1024 bytes packets, > >> compared > >>> with the original SW data path, DMA-enabled vhost-user PMD can > improve > >>> the throughput around 20%~30% in the VM2VM and PVP cases. > >> Furthermore, > >>> with larger packets, the throughput improvement will be higher. > >> > >> > >> I'm not sure it should be done like that for several reasons. > >> > >> First, it seems really complex for the user to get the command line > >> right. There is no mention in the doc patch on how to bind the DMAs to > >> the DPDK application. Are all the DMAs on the system capable of doing > >> it? > > > > DMA engines in Intel CPU are able to move data within system memory. > > Currently, we have I/OAT and we will have DSA in the future. > > OK, can you give me an example of how many I/OAT instances on a given > CPU?
One Xeon Platinum 8180 CPU has 8 I/OAT instances. > > >> I think it should be made transparent to the user, who should not have > >> to specify the DMA device address in command line. The user should just > >> pass a devarg specifying he wants to use DMAs, if available. > > > > How do you think of replacing DMA address with specific DMA capabilities, > like > > "dmas=[txq0@DMACOPY]". As I/OAT only supports data movement, we > can > > just provide basic DMA copy ability now. But when there are more DMA > devices, > > we can add capabilities in devargs later. > "dmas=[txq0@DMACOPY]" is still too complex IMHO. We should just have a > flag to enable or not DMA (tx_dma=1 / tx_dma=0), and this would be used > for all queues as we do for zero-copy. > > >> > >> Second, it looks too much vendor-specific. IMHO, we should have a DMA > >> framework, so that the driver can request DMA channels based on > >> capabilities. > > > > We only have one DMA engine, I/OAT, in DPDK, and it is implemented as > > a rawdev. IMO, it will be very hard to provide a generic DMA abstraction > > currently. In addition, I/OAT specific API is called inside vhost-user PMD, > > we can replace these function calls when we have a DMA framework in > > the future. Users are unaware of the changes. Does it make sense to you? > > Having an abstraction might be hard, but it does not seem impossible. > Such DMA abstraction has been done in the Kernel for IOAT. For example: > https://lore.kernel.org/patchwork/cover/56714/ > > >> > >> Also, I don't think implementing ring processing in the Vhost PMD is > >> welcome, Vhost PMD should just be a wrapper for the Vhost library. > Doing > >> that in Vhost PMD causes code duplication, and will be a maintenance > >> burden on the long run. > >> > >> As IOAT is a kind of acceleration, why not implement it through the vDPA > >> framework? vDPA framework should be extended to support this kind of > >> acceleration which requires some CPU processing, as opposed to full > >> offload of the ring processing as it only supports today. > > > > The main reason of implementing data path in vhost PMD is to avoid > impacting > > SW data path in vhost library. Even if we implement it as an instance of > vDPA, > > we also have to implement data path in a new vdev PMD, as DMA just > accelerates > > memory copy and all ring operations have to be done by the CPU. There is > still the > > code duplication issue. > > Ok, so what about: > > Introducing a pair of callbacks in struct virtio_net for DMA enqueue and > dequeue. > > lib/librte_vhost/ioat.c which would implement dma_enqueue and > dma_dequeue callback for IOAT. As it will live in the vhost lib > directory, it will be easy to refactor the code to share as much as > possible and so avoid code duplication. > > In rte_vhost_enqueue/dequeue_burst, if the dma callback is set, then > call it instead of the SW datapath. It adds a few cycle, but this is > much more sane IMHO. The problem is that current semantics of rte_vhost_enqueue/dequeue API are conflict with I/OAT accelerated data path. To improve the performance, the I/OAT works in an asynchronous manner, where the CPU just submits copy jobs to the I/OAT without waiting for its copy completion. For rte_vhost_enqueue_burst, users cannot reuse enqueued pktmbufs when it returns, as the I/OAT may still use them. For rte_vhost_dequeue_burst, users will not get incoming packets as the I/OAT is still performing packet copies. As you can see, when enabling I/OAT acceleration, the semantics of the two API are changed. If we keep the same API name but changing their semantic, this may confuse users, IMHO. Thanks, Jiayu > > What do you think? > > Thanks, > Maxime > > Thanks, > > Jiayu > > > >> > >> Kind regards, > >> Maxime > >> > >>> Jiayu Hu (4): > >>> vhost: populate guest memory for DMA-accelerated vhost-user > >>> net/vhost: setup vrings for DMA-accelerated datapath > >>> net/vhost: leverage DMA engines to accelerate Tx operations > >>> doc: add I/OAT acceleration support for vhost-user PMD > >>> > >>> doc/guides/nics/vhost.rst | 14 + > >>> drivers/Makefile | 2 +- > >>> drivers/net/vhost/Makefile | 6 +- > >>> drivers/net/vhost/internal.h | 160 +++++++ > >>> drivers/net/vhost/meson.build | 5 +- > >>> drivers/net/vhost/rte_eth_vhost.c | 308 +++++++++++--- > >>> drivers/net/vhost/virtio_net.c | 861 > >> ++++++++++++++++++++++++++++++++++++++ > >>> drivers/net/vhost/virtio_net.h | 288 +++++++++++++ > >>> lib/librte_vhost/rte_vhost.h | 1 + > >>> lib/librte_vhost/socket.c | 20 + > >>> lib/librte_vhost/vhost.h | 2 + > >>> lib/librte_vhost/vhost_user.c | 3 +- > >>> 12 files changed, 1597 insertions(+), 73 deletions(-) > >>> create mode 100644 drivers/net/vhost/internal.h > >>> create mode 100644 drivers/net/vhost/virtio_net.c > >>> create mode 100644 drivers/net/vhost/virtio_net.h > >>> > >