RE: OVS DPDK DMA-Dev library/Design Discussion

Van Haaren, Harry Tue, 29 Mar 2022 10:46:27 -0700

> -----Original Message-----
> From: Morten Brørup <[email protected]>
> Sent: Tuesday, March 29, 2022 6:14 PM
> To: Richardson, Bruce <[email protected]>
> Cc: Maxime Coquelin <[email protected]>; Van Haaren, Harry
> <[email protected]>; Pai G, Sunil <[email protected]>; Stokes, 
> Ian
> <[email protected]>; Hu, Jiayu <[email protected]>; Ferriter, Cian
> <[email protected]>; Ilya Maximets <[email protected]>; ovs-
> [email protected]; [email protected]; Mcnamara, John
> <[email protected]>; O'Driscoll, Tim <[email protected]>; Finn,
> Emma <[email protected]>
> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> 
> > From: Bruce Richardson [mailto:[email protected]]
> > Sent: Tuesday, 29 March 2022 19.03
> >
> > On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:
> > > > From: Maxime Coquelin [mailto:[email protected]]
> > > > Sent: Tuesday, 29 March 2022 18.24
> > > >
> > > > Hi Morten,
> > > >
> > > > On 3/29/22 16:44, Morten Brørup wrote:
> > > > >> From: Van Haaren, Harry [mailto:[email protected]]
> > > > >> Sent: Tuesday, 29 March 2022 15.02
> > > > >>
> > > > >>> From: Morten Brørup <[email protected]>
> > > > >>> Sent: Tuesday, March 29, 2022 1:51 PM
> > > > >>>
> > > > >>> Having thought more about it, I think that a completely
> > different
> > > > architectural approach is required:
> > > > >>>
> > > > >>> Many of the DPDK Ethernet PMDs implement a variety of RX and TX
> > > > packet burst functions, each optimized for different CPU vector
> > > > instruction sets. The availability of a DMA engine should be
> > treated
> > > > the same way. So I suggest that PMDs copying packet contents, e.g.
> > > > memif, pcap, vmxnet3, should implement DMA optimized RX and TX
> > packet
> > > > burst functions.
> > > > >>>
> > > > >>> Similarly for the DPDK vhost library.
> > > > >>>
> > > > >>> In such an architecture, it would be the application's job to
> > > > allocate DMA channels and assign them to the specific PMDs that
> > should
> > > > use them. But the actual use of the DMA channels would move down
> > below
> > > > the application and into the DPDK PMDs and libraries.
> > > > >>>
> > > > >>>
> > > > >>> Med venlig hilsen / Kind regards,
> > > > >>> -Morten Brørup
> > > > >>
> > > > >> Hi Morten,
> > > > >>
> > > > >> That's *exactly* how this architecture is designed &
> > implemented.
> > > > >> 1.   The DMA configuration and initialization is up to the
> > application
> > > > (OVS).
> > > > >> 2.   The VHost library is passed the DMA-dev ID, and its new
> > async
> > > > rx/tx APIs, and uses the DMA device to accelerate the copy.
> > > > >>
> > > > >> Looking forward to talking on the call that just started.
> > Regards, -
> > > > Harry
> > > > >>
> > > > >
> > > > > OK, thanks - as I said on the call, I haven't looked at the
> > patches.
> > > > >
> > > > > Then, I suppose that the TX completions can be handled in the TX
> > > > function, and the RX completions can be handled in the RX function,
> > > > just like the Ethdev PMDs handle packet descriptors:
> > > > >
> > > > > TX_Burst(tx_packet_array):
> > > > > 1.    Clean up descriptors processed by the NIC chip. --> Process
> > TX
> > > > DMA channel completions. (Effectively, the 2nd pipeline stage.)
> > > > > 2.    Pass on the tx_packet_array to the NIC chip descriptors. --
> > > Pass
> > > > on the tx_packet_array to the TX DMA channel. (Effectively, the 1st
> > > > pipeline stage.)
> > > >
> > > > The problem is Tx function might not be called again, so enqueued
> > > > packets in 2. may never be completed from a Virtio point of view.
> > IOW,
> > > > the packets will be copied to the Virtio descriptors buffers, but
> > the
> > > > descriptors will not be made available to the Virtio driver.
> > >
> > > In that case, the application needs to call TX_Burst() periodically
> > with an empty array, for completion purposes.


This is what the "defer work" does at the OVS thread-level, but instead of
"brute-forcing" and *always* making the call, the defer work concept tracks
*when* there is outstanding work (DMA copies) to be completed ("deferred work")
and calls the generic completion function at that point.

So "defer work" is generic infrastructure at the OVS thread level to handle
work that needs to be done "later", e.g. DMA completion handling.


> > > Or some sort of TX_Keepalive() function can be added to the DPDK
> > library, to handle DMA completion. It might even handle multiple DMA
> > channels, if convenient - and if possible without locking or other
> > weird complexity.

That's exactly how it is done, the VHost library has a new API added, which 
allows
for handling completions. And in the "Netdev layer" (~OVS ethdev abstraction)
we add a function to allow the OVS thread to do those completions in a new
Netdev-abstraction API called "async_process" where the completions can be 
checked.

The only method to abstract them is to "hide" them somewhere that will always be
polled, e.g. an ethdev port's RX function.  Both V3 and V4 approaches use this 
method.
This allows "completions" to be transparent to the app, at the tradeoff to 
having bad
separation  of concerns as Rx and Tx are now tied-together. 

The point is, the Application layer must *somehow * handle of completions.
So fundamentally there are 2 options for the Application level:

A) Make the application periodically call a "handle completions" function
        A1) Defer work, call when needed, and track "needed" at app layer, and 
calling into vhost txq complete as required.
                Elegant in that "no work" means "no cycles spent" on checking 
DMA completions.
        A2) Brute-force-always-call, and pay some overhead when not required.
                Cycle-cost in "no work" scenarios. Depending on # of vhost 
queues, this adds up as polling required *per vhost txq*.
                Also note that "checking DMA completions" means taking a 
virtq-lock, so this "brute-force" can needlessly increase x-thread contention!

B) Hide completions and live with the complexity/architectural sacrifice of 
mixed-RxTx. 
        Various downsides here in my opinion, see the slide deck presented 
earlier today for a summary. 

In my opinion, A1 is the most elegant solution, as it has a clean separation of 
concerns, does not  cause
avoidable contention on virtq locks, and spends no cycles when there is no 
completion work to do.


> > > Here is another idea, inspired by a presentation at one of the DPDK
> > Userspace conferences. It may be wishful thinking, though:
> > >
> > > Add an additional transaction to each DMA burst; a special
> > transaction containing the memory write operation that makes the
> > descriptors available to the Virtio driver.
> > >
> >
> > That is something that can work, so long as the receiver is operating
> > in
> > polling mode. For cases where virtio interrupts are enabled, you still
> > need
> > to do a write to the eventfd in the kernel in vhost to signal the
> > virtio
> > side. That's not something that can be offloaded to a DMA engine,
> > sadly, so
> > we still need some form of completion call.
> 
> I guess that virtio interrupts is the most widely deployed scenario, so let's 
> ignore
> the DMA TX completion transaction for now - and call it a possible future
> optimization for specific use cases. So it seems that some form of completion 
> call
> is unavoidable.

Agree to leave this aside, there is in theory a potential optimization, but
unlikely to be of large value.

RE: OVS DPDK DMA-Dev library/Design Discussion

Reply via email to