RE: OVS DPDK DMA-Dev library/Design Discussion

Van Haaren, Harry Thu, 07 Apr 2022 07:42:19 -0700

> -----Original Message-----
> From: Ilya Maximets <[email protected]>
> Sent: Thursday, April 7, 2022 3:40 PM
> To: Maxime Coquelin <[email protected]>; Van Haaren, Harry
> <[email protected]>; Morten Brørup <[email protected]>;
> Richardson, Bruce <[email protected]>
> Cc: [email protected]; Pai G, Sunil <[email protected]>; Stokes, Ian
> <[email protected]>; Hu, Jiayu <[email protected]>; Ferriter, Cian
> <[email protected]>; [email protected]; [email protected]; Mcnamara,
> John <[email protected]>; O'Driscoll, Tim <[email protected]>;
> Finn, Emma <[email protected]>
> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> 
> On 4/7/22 16:25, Maxime Coquelin wrote:
> > Hi Harry,
> >
> > On 4/7/22 16:04, Van Haaren, Harry wrote:
> >> Hi OVS & DPDK, Maintainers & Community,
> >>
> >> Top posting overview of discussion as replies to thread become slower:
> >> perhaps it is a good time to review and plan for next steps?
> >>
> >>  From my perspective, it those most vocal in the thread seem to be in 
> >> favour
> of the clean
> >> rx/tx split ("defer work"), with the tradeoff that the application must be
> aware of handling
> >> the async DMA completions. If there are any concerns opposing upstreaming
> of this method,
> >> please indicate this promptly, and we can continue technical discussions 
> >> here
> now.
> >
> > Wasn't there some discussions about handling the Virtio completions with
> > the DMA engine? With that, we wouldn't need the deferral of work.
> 
> +1


Yes there was, the DMA/virtq completions thread here for reference;
https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392908.html

I do not believe that there is a viable path to actually implementing it, and 
particularly
not in the more complex cases; e.g. virtio with guest-interrupt enabled.

The thread above mentions additional threads and various other options; none of 
which
I believe to be a clean or workable solution. I'd like input from other folks 
more familiar
with the exact implementations of VHost/vrings, as well as those with DMA 
engine expertise.


> With the virtio completions handled by DMA itself, the vhost port
> turns almost into a real HW NIC.  With that we will not need any
> extra manipulations from the OVS side, i.e. no need to defer any
> work while maintaining clear split between rx and tx operations.
> 
> I'd vote for that.
> 
> >
> > Thanks,
> > Maxime

Thanks for the prompt responses, and lets understand if there is a viable 
workable way
to totally hide DMA-completions from the application.

Regards,  -Harry


> >> In absence of continued technical discussion here, I suggest Sunil and Ian
> collaborate on getting
> >> the OVS Defer-work approach, and DPDK VHost Async patchsets available on
> GitHub for easier
> >> consumption and future development (as suggested in slides presented on
> last call).
> >>
> >> Regards, -Harry
> >>
> >> No inline-replies below; message just for context.
> >>
> >>> -----Original Message-----
> >>> From: Van Haaren, Harry
> >>> Sent: Wednesday, March 30, 2022 10:02 AM
> >>> To: Morten Brørup <[email protected]>; Richardson, Bruce
> >>> <[email protected]>
> >>> Cc: Maxime Coquelin <[email protected]>; Pai G, Sunil
> >>> <[email protected]>; Stokes, Ian <[email protected]>; Hu, Jiayu
> >>> <[email protected]>; Ferriter, Cian <[email protected]>; Ilya
> Maximets
> >>> <[email protected]>; [email protected]; [email protected];
> Mcnamara,
> >>> John <[email protected]>; O'Driscoll, Tim
> <[email protected]>;
> >>> Finn, Emma <[email protected]>
> >>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> >>>
> >>>> -----Original Message-----
> >>>> From: Morten Brørup <[email protected]>
> >>>> Sent: Tuesday, March 29, 2022 8:59 PM
> >>>> To: Van Haaren, Harry <[email protected]>; Richardson, Bruce
> >>>> <[email protected]>
> >>>> Cc: Maxime Coquelin <[email protected]>; Pai G, Sunil
> >>>> <[email protected]>; Stokes, Ian <[email protected]>; Hu, Jiayu
> >>>> <[email protected]>; Ferriter, Cian <[email protected]>; Ilya
> Maximets
> >>>> <[email protected]>; [email protected]; [email protected];
> Mcnamara,
> >>> John
> >>>> <[email protected]>; O'Driscoll, Tim <[email protected]>;
> Finn,
> >>>> Emma <[email protected]>
> >>>> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> >>>>
> >>>>> From: Van Haaren, Harry [mailto:[email protected]]
> >>>>> Sent: Tuesday, 29 March 2022 19.46
> >>>>>
> >>>>>> From: Morten Brørup <[email protected]>
> >>>>>> Sent: Tuesday, March 29, 2022 6:14 PM
> >>>>>>
> >>>>>>> From: Bruce Richardson [mailto:[email protected]]
> >>>>>>> Sent: Tuesday, 29 March 2022 19.03
> >>>>>>>
> >>>>>>> On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> >>>>>>>>> From: Maxime Coquelin [mailto:[email protected]]
> >>>>>>>>> Sent: Tuesday, 29 March 2022 18.24
> >>>>>>>>>
> >>>>>>>>> Hi Morten,
> >>>>>>>>>
> >>>>>>>>> On 3/29/22 16:44, Morten Brørup wrote:
> >>>>>>>>>>> From: Van Haaren, Harry [mailto:[email protected]]
> >>>>>>>>>>> Sent: Tuesday, 29 March 2022 15.02
> >>>>>>>>>>>
> >>>>>>>>>>>> From: Morten Brørup <[email protected]>
> >>>>>>>>>>>> Sent: Tuesday, March 29, 2022 1:51 PM
> >>>>>>>>>>>>
> >>>>>>>>>>>> Having thought more about it, I think that a completely
> >>>>>>> different
> >>>>>>>>> architectural approach is required:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Many of the DPDK Ethernet PMDs implement a variety of RX
> >>>>> and TX
> >>>>>>>>> packet burst functions, each optimized for different CPU vector
> >>>>>>>>> instruction sets. The availability of a DMA engine should be
> >>>>>>> treated
> >>>>>>>>> the same way. So I suggest that PMDs copying packet contents,
> >>>>> e.g.
> >>>>>>>>> memif, pcap, vmxnet3, should implement DMA optimized RX and TX
> >>>>>>> packet
> >>>>>>>>> burst functions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Similarly for the DPDK vhost library.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In such an architecture, it would be the application's job
> >>>>> to
> >>>>>>>>> allocate DMA channels and assign them to the specific PMDs that
> >>>>>>> should
> >>>>>>>>> use them. But the actual use of the DMA channels would move
> >>>>> down
> >>>>>>> below
> >>>>>>>>> the application and into the DPDK PMDs and libraries.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Med venlig hilsen / Kind regards,
> >>>>>>>>>>>> -Morten Brørup
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Morten,
> >>>>>>>>>>>
> >>>>>>>>>>> That's *exactly* how this architecture is designed &
> >>>>>>> implemented.
> >>>>>>>>>>> 1.    The DMA configuration and initialization is up to the
> >>>>>>> application
> >>>>>>>>> (OVS).
> >>>>>>>>>>> 2.    The VHost library is passed the DMA-dev ID, and its
> >>>>> new
> >>>>>>> async
> >>>>>>>>> rx/tx APIs, and uses the DMA device to accelerate the copy.
> >>>>>>>>>>>
> >>>>>>>>>>> Looking forward to talking on the call that just started.
> >>>>>>> Regards, -
> >>>>>>>>> Harry
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> OK, thanks - as I said on the call, I haven't looked at the
> >>>>>>> patches.
> >>>>>>>>>>
> >>>>>>>>>> Then, I suppose that the TX completions can be handled in the
> >>>>> TX
> >>>>>>>>> function, and the RX completions can be handled in the RX
> >>>>> function,
> >>>>>>>>> just like the Ethdev PMDs handle packet descriptors:
> >>>>>>>>>>
> >>>>>>>>>> TX_Burst(tx_packet_array):
> >>>>>>>>>> 1.    Clean up descriptors processed by the NIC chip. -->
> >>>>> Process
> >>>>>>> TX
> >>>>>>>>> DMA channel completions. (Effectively, the 2nd pipeline stage.)
> >>>>>>>>>> 2.    Pass on the tx_packet_array to the NIC chip
> >>>>> descriptors. --
> >>>>>>>> Pass
> >>>>>>>>> on the tx_packet_array to the TX DMA channel. (Effectively, the
> >>>>> 1st
> >>>>>>>>> pipeline stage.)
> >>>>>>>>>
> >>>>>>>>> The problem is Tx function might not be called again, so
> >>>>> enqueued
> >>>>>>>>> packets in 2. may never be completed from a Virtio point of
> >>>>> view.
> >>>>>>> IOW,
> >>>>>>>>> the packets will be copied to the Virtio descriptors buffers,
> >>>>> but
> >>>>>>> the
> >>>>>>>>> descriptors will not be made available to the Virtio driver.
> >>>>>>>>
> >>>>>>>> In that case, the application needs to call TX_Burst()
> >>>>> periodically
> >>>>>>> with an empty array, for completion purposes.
> >>>>>
> >>>>> This is what the "defer work" does at the OVS thread-level, but instead
> >>>>> of
> >>>>> "brute-forcing" and *always* making the call, the defer work concept
> >>>>> tracks
> >>>>> *when* there is outstanding work (DMA copies) to be completed
> >>>>> ("deferred work")
> >>>>> and calls the generic completion function at that point.
> >>>>>
> >>>>> So "defer work" is generic infrastructure at the OVS thread level to
> >>>>> handle
> >>>>> work that needs to be done "later", e.g. DMA completion handling.
> >>>>>
> >>>>>
> >>>>>>>> Or some sort of TX_Keepalive() function can be added to the DPDK
> >>>>>>> library, to handle DMA completion. It might even handle multiple
> >>>>> DMA
> >>>>>>> channels, if convenient - and if possible without locking or other
> >>>>>>> weird complexity.
> >>>>>
> >>>>> That's exactly how it is done, the VHost library has a new API added,
> >>>>> which allows
> >>>>> for handling completions. And in the "Netdev layer" (~OVS ethdev
> >>>>> abstraction)
> >>>>> we add a function to allow the OVS thread to do those completions in a
> >>>>> new
> >>>>> Netdev-abstraction API called "async_process" where the completions can
> >>>>> be checked.
> >>>>>
> >>>>> The only method to abstract them is to "hide" them somewhere that will
> >>>>> always be
> >>>>> polled, e.g. an ethdev port's RX function.  Both V3 and V4 approaches
> >>>>> use this method.
> >>>>> This allows "completions" to be transparent to the app, at the tradeoff
> >>>>> to having bad
> >>>>> separation  of concerns as Rx and Tx are now tied-together.
> >>>>>
> >>>>> The point is, the Application layer must *somehow * handle of
> >>>>> completions.
> >>>>> So fundamentally there are 2 options for the Application level:
> >>>>>
> >>>>> A) Make the application periodically call a "handle completions"
> >>>>> function
> >>>>>     A1) Defer work, call when needed, and track "needed" at app
> >>>>> layer, and calling into vhost txq complete as required.
> >>>>>             Elegant in that "no work" means "no cycles spent" on
> >>>>> checking DMA completions.
> >>>>>     A2) Brute-force-always-call, and pay some overhead when not
> >>>>> required.
> >>>>>             Cycle-cost in "no work" scenarios. Depending on # of
> >>>>> vhost queues, this adds up as polling required *per vhost txq*.
> >>>>>             Also note that "checking DMA completions" means taking a
> >>>>> virtq-lock, so this "brute-force" can needlessly increase x-thread
> >>>>> contention!
> >>>>
> >>>> A side note: I don't see why locking is required to test for DMA
> completions.
> >>>> rte_dma_vchan_status() is lockless, e.g.:
> >>>>
> >>>
> https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_dmadev.c#L
> >>> 56
> >>>> 0
> >>>
> >>> Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a lockfree
> manner
> >>> from a single thread.
> >>>
> >>> The locks I refer to are at the OVS-netdev level, as virtq's are shared 
> >>> across
> OVS's
> >>> dataplane threads.
> >>> So the "M to N" comes from M dataplane threads to N virtqs, hence
> requiring
> >>> some locking.
> >>>
> >>>
> >>>>> B) Hide completions and live with the complexity/architectural
> >>>>> sacrifice of mixed-RxTx.
> >>>>>     Various downsides here in my opinion, see the slide deck
> >>>>> presented earlier today for a summary.
> >>>>>
> >>>>> In my opinion, A1 is the most elegant solution, as it has a clean
> >>>>> separation of concerns, does not  cause
> >>>>> avoidable contention on virtq locks, and spends no cycles when there is
> >>>>> no completion work to do.
> >>>>>
> >>>>
> >>>> Thank you for elaborating, Harry.
> >>>
> >>> Thanks for part-taking in the discussion & providing your insight!
> >>>
> >>>> I strongly oppose against hiding any part of TX processing in an RX 
> >>>> function.
> It
> >>> is just
> >>>> wrong in so many ways!
> >>>>
> >>>> I agree that A1 is the most elegant solution. And being the most elegant
> >>> solution, it
> >>>> is probably also the most future proof solution. :-)
> >>>
> >>> I think so too, yes.
> >>>
> >>>> I would also like to stress that DMA completion handling belongs in the
> DPDK
> >>>> library, not in the application. And yes, the application will be 
> >>>> required to
> call
> >>> some
> >>>> "handle DMA completions" function in the DPDK library. But since the
> >>> application
> >>>> already knows that it uses DMA, the application should also know that it
> needs
> >>> to
> >>>> call this extra function - so I consider this requirement perfectly 
> >>>> acceptable.
> >>>
> >>> Agree here.
> >>>
> >>>> I prefer if the DPDK vhost library can hide its inner workings from the
> >>> application,
> >>>> and just expose the additional "handle completions" function. This also
> means
> >>> that
> >>>> the inner workings can be implemented as "defer work", or by some other
> >>>> algorithm. And it can be tweaked and optimized later.
> >>>
> >>> Yes, the choice in how to call the handle_completions function is 
> >>> Application
> >>> layer.
> >>> For OVS we designed Defer Work, V3 and V4. But it is an App level choice,
> and
> >>> every
> >>> application is free to choose its own method.
> >>>
> >>>> Thinking about the long term perspective, this design pattern is common
> for
> >>> both
> >>>> the vhost library and other DPDK libraries that could benefit from DMA 
> >>>> (e.g.
> >>>> vmxnet3 and pcap PMDs), so it could be abstracted into the DMA library or
> a
> >>>> separate library. But for now, we should focus on the vhost use case, and
> just
> >>> keep
> >>>> the long term roadmap for using DMA in mind.
> >>>
> >>> Totally agree to keep long term roadmap in mind; but I'm not sure we can
> >>> refactor
> >>> logic out of vhost. When DMA-completions arrive, the virtQ needs to be
> >>> updated;
> >>> this causes a tight coupling between the DMA completion count, and the
> vhost
> >>> library.
> >>>
> >>> As Ilya raised on the call yesterday, there is an "in_order" requirement 
> >>> in the
> >>> vhost
> >>> library, that per virtq the packets are presented to the guest "in order" 
> >>> of
> >>> enqueue.
> >>> (To be clear, *not* order of DMA-completion! As Jiayu mentioned, the Vhost
> >>> library
> >>> handles this today by re-ordering the DMA completions.)
> >>>
> >>>
> >>>> Rephrasing what I said on the conference call: This vhost design will
> become
> >>> the
> >>>> common design pattern for using DMA in DPDK libraries. If we get it 
> >>>> wrong,
> we
> >>> are
> >>>> stuck with it.
> >>>
> >>> Agree, and if we get it right, then we're stuck with it too! :)
> >>>
> >>>
> >>>>>>>> Here is another idea, inspired by a presentation at one of the
> >>>>> DPDK
> >>>>>>> Userspace conferences. It may be wishful thinking, though:
> >>>>>>>>
> >>>>>>>> Add an additional transaction to each DMA burst; a special
> >>>>>>> transaction containing the memory write operation that makes the
> >>>>>>> descriptors available to the Virtio driver.
> >>>>>>>>
> >>>>>>>
> >>>>>>> That is something that can work, so long as the receiver is
> >>>>> operating
> >>>>>>> in
> >>>>>>> polling mode. For cases where virtio interrupts are enabled, you
> >>>>> still
> >>>>>>> need
> >>>>>>> to do a write to the eventfd in the kernel in vhost to signal the
> >>>>>>> virtio
> >>>>>>> side. That's not something that can be offloaded to a DMA engine,
> >>>>>>> sadly, so
> >>>>>>> we still need some form of completion call.
> >>>>>>
> >>>>>> I guess that virtio interrupts is the most widely deployed scenario,
> >>>>> so let's ignore
> >>>>>> the DMA TX completion transaction for now - and call it a possible
> >>>>> future
> >>>>>> optimization for specific use cases. So it seems that some form of
> >>>>> completion call
> >>>>>> is unavoidable.
> >>>>>
> >>>>> Agree to leave this aside, there is in theory a potential optimization,
> >>>>> but
> >>>>> unlikely to be of large value.
> >>>>>
> >>>>
> >>>> One more thing: When using DMA to pass on packets into a guest, there
> could
> >>> be a
> >>>> delay from the DMA completes until the guest is signaled. Is there any 
> >>>> CPU
> >>> cache
> >>>> hotness regarding the guest's access to the packet data to consider here?
> I.e. if
> >>> we
> >>>> wait signaling the guest, the packet data may get cold.
> >>>
> >>> Interesting question; we can likely spawn a new thread around this topic!
> >>> In short, it depends on how/where the DMA hardware writes the copy.
> >>>
> >>> With technologies like DDIO, the "dest" part of the copy will be in LLC. 
> >>> The
> core
> >>> reading the
> >>> dest data will benefit from the LLC locality (instead of snooping it from 
> >>> a
> remote
> >>> core's L1/L2).
> >>>
> >>> Delays in notifying the guest could result in LLC capacity eviction, yes.
> >>> The application layer decides how often/promptly to check for completions,
> >>> and notify the guest of them. Calling the function more often will result 
> >>> in
> less
> >>> delay in that portion of the pipeline.
> >>>
> >>> Overall, there are caching benefits with DMA acceleration, and the
> application
> >>> can control
> >>> the latency introduced between dma-completion done in HW, and Guest
> vring
> >>> update.
> >>
> >

RE: OVS DPDK DMA-Dev library/Design Discussion

Reply via email to