> -----Original Message----- > From: Ilya Maximets <i.maxim...@ovn.org> > Sent: Thursday, April 7, 2022 3:40 PM > To: Maxime Coquelin <maxime.coque...@redhat.com>; Van Haaren, Harry > <harry.van.haa...@intel.com>; Morten Brørup <m...@smartsharesystems.com>; > Richardson, Bruce <bruce.richard...@intel.com> > Cc: i.maxim...@ovn.org; Pai G, Sunil <sunil.pa...@intel.com>; Stokes, Ian > <ian.sto...@intel.com>; Hu, Jiayu <jiayu...@intel.com>; Ferriter, Cian > <cian.ferri...@intel.com>; ovs-...@openvswitch.org; dev@dpdk.org; Mcnamara, > John <john.mcnam...@intel.com>; O'Driscoll, Tim <tim.odrisc...@intel.com>; > Finn, Emma <emma.f...@intel.com> > Subject: Re: OVS DPDK DMA-Dev library/Design Discussion > > On 4/7/22 16:25, Maxime Coquelin wrote: > > Hi Harry, > > > > On 4/7/22 16:04, Van Haaren, Harry wrote: > >> Hi OVS & DPDK, Maintainers & Community, > >> > >> Top posting overview of discussion as replies to thread become slower: > >> perhaps it is a good time to review and plan for next steps? > >> > >> From my perspective, it those most vocal in the thread seem to be in > >> favour > of the clean > >> rx/tx split ("defer work"), with the tradeoff that the application must be > aware of handling > >> the async DMA completions. If there are any concerns opposing upstreaming > of this method, > >> please indicate this promptly, and we can continue technical discussions > >> here > now. > > > > Wasn't there some discussions about handling the Virtio completions with > > the DMA engine? With that, we wouldn't need the deferral of work. > > +1
Yes there was, the DMA/virtq completions thread here for reference; https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392908.html I do not believe that there is a viable path to actually implementing it, and particularly not in the more complex cases; e.g. virtio with guest-interrupt enabled. The thread above mentions additional threads and various other options; none of which I believe to be a clean or workable solution. I'd like input from other folks more familiar with the exact implementations of VHost/vrings, as well as those with DMA engine expertise. > With the virtio completions handled by DMA itself, the vhost port > turns almost into a real HW NIC. With that we will not need any > extra manipulations from the OVS side, i.e. no need to defer any > work while maintaining clear split between rx and tx operations. > > I'd vote for that. > > > > > Thanks, > > Maxime Thanks for the prompt responses, and lets understand if there is a viable workable way to totally hide DMA-completions from the application. Regards, -Harry > >> In absence of continued technical discussion here, I suggest Sunil and Ian > collaborate on getting > >> the OVS Defer-work approach, and DPDK VHost Async patchsets available on > GitHub for easier > >> consumption and future development (as suggested in slides presented on > last call). > >> > >> Regards, -Harry > >> > >> No inline-replies below; message just for context. > >> > >>> -----Original Message----- > >>> From: Van Haaren, Harry > >>> Sent: Wednesday, March 30, 2022 10:02 AM > >>> To: Morten Brørup <m...@smartsharesystems.com>; Richardson, Bruce > >>> <bruce.richard...@intel.com> > >>> Cc: Maxime Coquelin <maxime.coque...@redhat.com>; Pai G, Sunil > >>> <sunil.pa...@intel.com>; Stokes, Ian <ian.sto...@intel.com>; Hu, Jiayu > >>> <jiayu...@intel.com>; Ferriter, Cian <cian.ferri...@intel.com>; Ilya > Maximets > >>> <i.maxim...@ovn.org>; ovs-...@openvswitch.org; dev@dpdk.org; > Mcnamara, > >>> John <john.mcnam...@intel.com>; O'Driscoll, Tim > <tim.odrisc...@intel.com>; > >>> Finn, Emma <emma.f...@intel.com> > >>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion > >>> > >>>> -----Original Message----- > >>>> From: Morten Brørup <m...@smartsharesystems.com> > >>>> Sent: Tuesday, March 29, 2022 8:59 PM > >>>> To: Van Haaren, Harry <harry.van.haa...@intel.com>; Richardson, Bruce > >>>> <bruce.richard...@intel.com> > >>>> Cc: Maxime Coquelin <maxime.coque...@redhat.com>; Pai G, Sunil > >>>> <sunil.pa...@intel.com>; Stokes, Ian <ian.sto...@intel.com>; Hu, Jiayu > >>>> <jiayu...@intel.com>; Ferriter, Cian <cian.ferri...@intel.com>; Ilya > Maximets > >>>> <i.maxim...@ovn.org>; ovs-...@openvswitch.org; dev@dpdk.org; > Mcnamara, > >>> John > >>>> <john.mcnam...@intel.com>; O'Driscoll, Tim <tim.odrisc...@intel.com>; > Finn, > >>>> Emma <emma.f...@intel.com> > >>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion > >>>> > >>>>> From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com] > >>>>> Sent: Tuesday, 29 March 2022 19.46 > >>>>> > >>>>>> From: Morten Brørup <m...@smartsharesystems.com> > >>>>>> Sent: Tuesday, March 29, 2022 6:14 PM > >>>>>> > >>>>>>> From: Bruce Richardson [mailto:bruce.richard...@intel.com] > >>>>>>> Sent: Tuesday, 29 March 2022 19.03 > >>>>>>> > >>>>>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote: > >>>>>>>>> From: Maxime Coquelin [mailto:maxime.coque...@redhat.com] > >>>>>>>>> Sent: Tuesday, 29 March 2022 18.24 > >>>>>>>>> > >>>>>>>>> Hi Morten, > >>>>>>>>> > >>>>>>>>> On 3/29/22 16:44, Morten Brørup wrote: > >>>>>>>>>>> From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com] > >>>>>>>>>>> Sent: Tuesday, 29 March 2022 15.02 > >>>>>>>>>>> > >>>>>>>>>>>> From: Morten Brørup <m...@smartsharesystems.com> > >>>>>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM > >>>>>>>>>>>> > >>>>>>>>>>>> Having thought more about it, I think that a completely > >>>>>>> different > >>>>>>>>> architectural approach is required: > >>>>>>>>>>>> > >>>>>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX > >>>>> and TX > >>>>>>>>> packet burst functions, each optimized for different CPU vector > >>>>>>>>> instruction sets. The availability of a DMA engine should be > >>>>>>> treated > >>>>>>>>> the same way. So I suggest that PMDs copying packet contents, > >>>>> e.g. > >>>>>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX > >>>>>>> packet > >>>>>>>>> burst functions. > >>>>>>>>>>>> > >>>>>>>>>>>> Similarly for the DPDK vhost library. > >>>>>>>>>>>> > >>>>>>>>>>>> In such an architecture, it would be the application's job > >>>>> to > >>>>>>>>> allocate DMA channels and assign them to the specific PMDs that > >>>>>>> should > >>>>>>>>> use them. But the actual use of the DMA channels would move > >>>>> down > >>>>>>> below > >>>>>>>>> the application and into the DPDK PMDs and libraries. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> Med venlig hilsen / Kind regards, > >>>>>>>>>>>> -Morten Brørup > >>>>>>>>>>> > >>>>>>>>>>> Hi Morten, > >>>>>>>>>>> > >>>>>>>>>>> That's *exactly* how this architecture is designed & > >>>>>>> implemented. > >>>>>>>>>>> 1. The DMA configuration and initialization is up to the > >>>>>>> application > >>>>>>>>> (OVS). > >>>>>>>>>>> 2. The VHost library is passed the DMA-dev ID, and its > >>>>> new > >>>>>>> async > >>>>>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy. > >>>>>>>>>>> > >>>>>>>>>>> Looking forward to talking on the call that just started. > >>>>>>> Regards, - > >>>>>>>>> Harry > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> OK, thanks - as I said on the call, I haven't looked at the > >>>>>>> patches. > >>>>>>>>>> > >>>>>>>>>> Then, I suppose that the TX completions can be handled in the > >>>>> TX > >>>>>>>>> function, and the RX completions can be handled in the RX > >>>>> function, > >>>>>>>>> just like the Ethdev PMDs handle packet descriptors: > >>>>>>>>>> > >>>>>>>>>> TX_Burst(tx_packet_array): > >>>>>>>>>> 1. Clean up descriptors processed by the NIC chip. --> > >>>>> Process > >>>>>>> TX > >>>>>>>>> DMA channel completions. (Effectively, the 2nd pipeline stage.) > >>>>>>>>>> 2. Pass on the tx_packet_array to the NIC chip > >>>>> descriptors. -- > >>>>>>>> Pass > >>>>>>>>> on the tx_packet_array to the TX DMA channel. (Effectively, the > >>>>> 1st > >>>>>>>>> pipeline stage.) > >>>>>>>>> > >>>>>>>>> The problem is Tx function might not be called again, so > >>>>> enqueued > >>>>>>>>> packets in 2. may never be completed from a Virtio point of > >>>>> view. > >>>>>>> IOW, > >>>>>>>>> the packets will be copied to the Virtio descriptors buffers, > >>>>> but > >>>>>>> the > >>>>>>>>> descriptors will not be made available to the Virtio driver. > >>>>>>>> > >>>>>>>> In that case, the application needs to call TX_Burst() > >>>>> periodically > >>>>>>> with an empty array, for completion purposes. > >>>>> > >>>>> This is what the "defer work" does at the OVS thread-level, but instead > >>>>> of > >>>>> "brute-forcing" and *always* making the call, the defer work concept > >>>>> tracks > >>>>> *when* there is outstanding work (DMA copies) to be completed > >>>>> ("deferred work") > >>>>> and calls the generic completion function at that point. > >>>>> > >>>>> So "defer work" is generic infrastructure at the OVS thread level to > >>>>> handle > >>>>> work that needs to be done "later", e.g. DMA completion handling. > >>>>> > >>>>> > >>>>>>>> Or some sort of TX_Keepalive() function can be added to the DPDK > >>>>>>> library, to handle DMA completion. It might even handle multiple > >>>>> DMA > >>>>>>> channels, if convenient - and if possible without locking or other > >>>>>>> weird complexity. > >>>>> > >>>>> That's exactly how it is done, the VHost library has a new API added, > >>>>> which allows > >>>>> for handling completions. And in the "Netdev layer" (~OVS ethdev > >>>>> abstraction) > >>>>> we add a function to allow the OVS thread to do those completions in a > >>>>> new > >>>>> Netdev-abstraction API called "async_process" where the completions can > >>>>> be checked. > >>>>> > >>>>> The only method to abstract them is to "hide" them somewhere that will > >>>>> always be > >>>>> polled, e.g. an ethdev port's RX function. Both V3 and V4 approaches > >>>>> use this method. > >>>>> This allows "completions" to be transparent to the app, at the tradeoff > >>>>> to having bad > >>>>> separation of concerns as Rx and Tx are now tied-together. > >>>>> > >>>>> The point is, the Application layer must *somehow * handle of > >>>>> completions. > >>>>> So fundamentally there are 2 options for the Application level: > >>>>> > >>>>> A) Make the application periodically call a "handle completions" > >>>>> function > >>>>> A1) Defer work, call when needed, and track "needed" at app > >>>>> layer, and calling into vhost txq complete as required. > >>>>> Elegant in that "no work" means "no cycles spent" on > >>>>> checking DMA completions. > >>>>> A2) Brute-force-always-call, and pay some overhead when not > >>>>> required. > >>>>> Cycle-cost in "no work" scenarios. Depending on # of > >>>>> vhost queues, this adds up as polling required *per vhost txq*. > >>>>> Also note that "checking DMA completions" means taking a > >>>>> virtq-lock, so this "brute-force" can needlessly increase x-thread > >>>>> contention! > >>>> > >>>> A side note: I don't see why locking is required to test for DMA > completions. > >>>> rte_dma_vchan_status() is lockless, e.g.: > >>>> > >>> > https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_dmadev.c#L > >>> 56 > >>>> 0 > >>> > >>> Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a lockfree > manner > >>> from a single thread. > >>> > >>> The locks I refer to are at the OVS-netdev level, as virtq's are shared > >>> across > OVS's > >>> dataplane threads. > >>> So the "M to N" comes from M dataplane threads to N virtqs, hence > requiring > >>> some locking. > >>> > >>> > >>>>> B) Hide completions and live with the complexity/architectural > >>>>> sacrifice of mixed-RxTx. > >>>>> Various downsides here in my opinion, see the slide deck > >>>>> presented earlier today for a summary. > >>>>> > >>>>> In my opinion, A1 is the most elegant solution, as it has a clean > >>>>> separation of concerns, does not cause > >>>>> avoidable contention on virtq locks, and spends no cycles when there is > >>>>> no completion work to do. > >>>>> > >>>> > >>>> Thank you for elaborating, Harry. > >>> > >>> Thanks for part-taking in the discussion & providing your insight! > >>> > >>>> I strongly oppose against hiding any part of TX processing in an RX > >>>> function. > It > >>> is just > >>>> wrong in so many ways! > >>>> > >>>> I agree that A1 is the most elegant solution. And being the most elegant > >>> solution, it > >>>> is probably also the most future proof solution. :-) > >>> > >>> I think so too, yes. > >>> > >>>> I would also like to stress that DMA completion handling belongs in the > DPDK > >>>> library, not in the application. And yes, the application will be > >>>> required to > call > >>> some > >>>> "handle DMA completions" function in the DPDK library. But since the > >>> application > >>>> already knows that it uses DMA, the application should also know that it > needs > >>> to > >>>> call this extra function - so I consider this requirement perfectly > >>>> acceptable. > >>> > >>> Agree here. > >>> > >>>> I prefer if the DPDK vhost library can hide its inner workings from the > >>> application, > >>>> and just expose the additional "handle completions" function. This also > means > >>> that > >>>> the inner workings can be implemented as "defer work", or by some other > >>>> algorithm. And it can be tweaked and optimized later. > >>> > >>> Yes, the choice in how to call the handle_completions function is > >>> Application > >>> layer. > >>> For OVS we designed Defer Work, V3 and V4. But it is an App level choice, > and > >>> every > >>> application is free to choose its own method. > >>> > >>>> Thinking about the long term perspective, this design pattern is common > for > >>> both > >>>> the vhost library and other DPDK libraries that could benefit from DMA > >>>> (e.g. > >>>> vmxnet3 and pcap PMDs), so it could be abstracted into the DMA library or > a > >>>> separate library. But for now, we should focus on the vhost use case, and > just > >>> keep > >>>> the long term roadmap for using DMA in mind. > >>> > >>> Totally agree to keep long term roadmap in mind; but I'm not sure we can > >>> refactor > >>> logic out of vhost. When DMA-completions arrive, the virtQ needs to be > >>> updated; > >>> this causes a tight coupling between the DMA completion count, and the > vhost > >>> library. > >>> > >>> As Ilya raised on the call yesterday, there is an "in_order" requirement > >>> in the > >>> vhost > >>> library, that per virtq the packets are presented to the guest "in order" > >>> of > >>> enqueue. > >>> (To be clear, *not* order of DMA-completion! As Jiayu mentioned, the Vhost > >>> library > >>> handles this today by re-ordering the DMA completions.) > >>> > >>> > >>>> Rephrasing what I said on the conference call: This vhost design will > become > >>> the > >>>> common design pattern for using DMA in DPDK libraries. If we get it > >>>> wrong, > we > >>> are > >>>> stuck with it. > >>> > >>> Agree, and if we get it right, then we're stuck with it too! :) > >>> > >>> > >>>>>>>> Here is another idea, inspired by a presentation at one of the > >>>>> DPDK > >>>>>>> Userspace conferences. It may be wishful thinking, though: > >>>>>>>> > >>>>>>>> Add an additional transaction to each DMA burst; a special > >>>>>>> transaction containing the memory write operation that makes the > >>>>>>> descriptors available to the Virtio driver. > >>>>>>>> > >>>>>>> > >>>>>>> That is something that can work, so long as the receiver is > >>>>> operating > >>>>>>> in > >>>>>>> polling mode. For cases where virtio interrupts are enabled, you > >>>>> still > >>>>>>> need > >>>>>>> to do a write to the eventfd in the kernel in vhost to signal the > >>>>>>> virtio > >>>>>>> side. That's not something that can be offloaded to a DMA engine, > >>>>>>> sadly, so > >>>>>>> we still need some form of completion call. > >>>>>> > >>>>>> I guess that virtio interrupts is the most widely deployed scenario, > >>>>> so let's ignore > >>>>>> the DMA TX completion transaction for now - and call it a possible > >>>>> future > >>>>>> optimization for specific use cases. So it seems that some form of > >>>>> completion call > >>>>>> is unavoidable. > >>>>> > >>>>> Agree to leave this aside, there is in theory a potential optimization, > >>>>> but > >>>>> unlikely to be of large value. > >>>>> > >>>> > >>>> One more thing: When using DMA to pass on packets into a guest, there > could > >>> be a > >>>> delay from the DMA completes until the guest is signaled. Is there any > >>>> CPU > >>> cache > >>>> hotness regarding the guest's access to the packet data to consider here? > I.e. if > >>> we > >>>> wait signaling the guest, the packet data may get cold. > >>> > >>> Interesting question; we can likely spawn a new thread around this topic! > >>> In short, it depends on how/where the DMA hardware writes the copy. > >>> > >>> With technologies like DDIO, the "dest" part of the copy will be in LLC. > >>> The > core > >>> reading the > >>> dest data will benefit from the LLC locality (instead of snooping it from > >>> a > remote > >>> core's L1/L2). > >>> > >>> Delays in notifying the guest could result in LLC capacity eviction, yes. > >>> The application layer decides how often/promptly to check for completions, > >>> and notify the guest of them. Calling the function more often will result > >>> in > less > >>> delay in that portion of the pipeline. > >>> > >>> Overall, there are caching benefits with DMA acceleration, and the > application > >>> can control > >>> the latency introduced between dma-completion done in HW, and Guest > vring > >>> update. > >> > >