On Thu, Apr 28, 2022 at 02:59:37PM +0200, Ilya Maximets wrote: > On 4/27/22 22:34, Bruce Richardson wrote: > > On Mon, Apr 25, 2022 at 11:46:01PM +0200, Ilya Maximets wrote: > >> On 4/20/22 18:41, Mcnamara, John wrote: > >>>> -----Original Message----- > >>>> From: Ilya Maximets <i.maxim...@ovn.org> > >>>> Sent: Friday, April 8, 2022 10:58 AM > >>>> To: Hu, Jiayu <jiayu...@intel.com>; Maxime Coquelin > >>>> <maxime.coque...@redhat.com>; Van Haaren, Harry > >>>> <harry.van.haa...@intel.com>; Morten Brørup <m...@smartsharesystems.com>; > >>>> Richardson, Bruce <bruce.richard...@intel.com> > >>>> Cc: i.maxim...@ovn.org; Pai G, Sunil <sunil.pa...@intel.com>; Stokes, Ian > >>>> <ian.sto...@intel.com>; Ferriter, Cian <cian.ferri...@intel.com>; ovs- > >>>> d...@openvswitch.org; dev@dpdk.org; Mcnamara, John > >>>> <john.mcnam...@intel.com>; O'Driscoll, Tim <tim.odrisc...@intel.com>; > >>>> Finn, Emma <emma.f...@intel.com> > >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion > >>>> > >>>> On 4/8/22 09:13, Hu, Jiayu wrote: > >>>>> > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: Ilya Maximets <i.maxim...@ovn.org> > >>>>>> Sent: Thursday, April 7, 2022 10:40 PM > >>>>>> To: Maxime Coquelin <maxime.coque...@redhat.com>; Van Haaren, Harry > >>>>>> <harry.van.haa...@intel.com>; Morten Brørup > >>>>>> <m...@smartsharesystems.com>; Richardson, Bruce > >>>>>> <bruce.richard...@intel.com> > >>>>>> Cc: i.maxim...@ovn.org; Pai G, Sunil <sunil.pa...@intel.com>; Stokes, > >>>>>> Ian <ian.sto...@intel.com>; Hu, Jiayu <jiayu...@intel.com>; Ferriter, > >>>>>> Cian <cian.ferri...@intel.com>; ovs-...@openvswitch.org; > >>>>>> dev@dpdk.org; Mcnamara, John <john.mcnam...@intel.com>; O'Driscoll, > >>>>>> Tim <tim.odrisc...@intel.com>; Finn, Emma <emma.f...@intel.com> > >>>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion > >>>>>> > >>>>>> On 4/7/22 16:25, Maxime Coquelin wrote: > >>>>>>> Hi Harry, > >>>>>>> > >>>>>>> On 4/7/22 16:04, Van Haaren, Harry wrote: > >>>>>>>> Hi OVS & DPDK, Maintainers & Community, > >>>>>>>> > >>>>>>>> Top posting overview of discussion as replies to thread become > >>>> slower: > >>>>>>>> perhaps it is a good time to review and plan for next steps? > >>>>>>>> > >>>>>>>> From my perspective, it those most vocal in the thread seem to be > >>>>>>>> in favour of the clean rx/tx split ("defer work"), with the > >>>>>>>> tradeoff that the application must be aware of handling the async > >>>>>>>> DMA completions. If there are any concerns opposing upstreaming of > >>>>>>>> this > >>>>>> method, please indicate this promptly, and we can continue technical > >>>>>> discussions here now. > >>>>>>> > >>>>>>> Wasn't there some discussions about handling the Virtio completions > >>>>>>> with the DMA engine? With that, we wouldn't need the deferral of work. > >>>>>> > >>>>>> +1 > >>>>>> > >>>>>> With the virtio completions handled by DMA itself, the vhost port > >>>>>> turns almost into a real HW NIC. With that we will not need any > >>>>>> extra manipulations from the OVS side, i.e. no need to defer any work > >>>>>> while maintaining clear split between rx and tx operations. > >>>>> > >>>>> First, making DMA do 2B copy would sacrifice performance, and I think > >>>>> we all agree on that. > >>>> > >>>> I do not agree with that. Yes, 2B copy by DMA will likely be slower than > >>>> done by CPU, however CPU is going away for dozens or even hundreds of > >>>> thousands of cycles to process a new packet batch or service other ports, > >>>> hence DMA will likely complete the transmission faster than waiting for > >>>> the CPU thread to come back to that task. In any case, this has to be > >>>> tested. > >>>> > >>>>> Second, this method comes with an issue of ordering. > >>>>> For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD > >>>>> thread1 enqueue 20 packets to vring0. If PMD thread0 and threa1 have > >>>>> own dedicated DMA device dma0 and dma1, flag/index update for the > >>>>> first 10 packets is done by dma0, and flag/index update for the left > >>>>> 20 packets is done by dma1. But there is no ordering guarantee among > >>>>> different DMA devices, so flag/index update may error. If PMD threads > >>>>> don't have dedicated DMA devices, which means DMA devices are shared > >>>>> among threads, we need lock and pay for lock contention in data-path. > >>>>> Or we can allocate DMA devices for vring dynamically to avoid DMA > >>>>> sharing among threads. But what's the overhead of allocation mechanism? > >>>> Who does it? Any thoughts? > >>>> > >>>> 1. DMA completion was discussed in context of per-queue allocation, so > >>>> there > >>>> is no re-ordering in this case. > >>>> > >>>> 2. Overhead can be minimal if allocated device can stick to the queue for > >>>> a > >>>> reasonable amount of time without re-allocation on every send. You > >>>> may > >>>> look at XPS implementation in lib/dpif-netdev.c in OVS for example of > >>>> such mechanism. For sure it can not be the same, but ideas can be re- > >>>> used. > >>>> > >>>> 3. Locking doesn't mean contention if resources are allocated/distributed > >>>> thoughtfully. > >>>> > >>>> 4. Allocation can be done be either OVS or vhost library itself, I'd vote > >>>> for doing that inside the vhost library, so any DPDK application and > >>>> vhost ethdev can use it without re-inventing from scratch. It also > >>>> should > >>>> be simpler from the API point of view if allocation and usage are in > >>>> the same place. But I don't have a strong opinion here as for now, > >>>> since > >>>> no real code examples exist, so it's hard to evaluate how they could > >>>> look > >>>> like. > >>>> > >>>> But I feel like we're starting to run in circles here as I did already > >>>> say > >>>> most of that before. > >>> > >>> > >> > >> Hi, John. > >> > >> Just reading this email as I was on PTO for a last 1.5 weeks > >> and didn't get through all the emails yet. > >> > >>> This does seem to be going in circles, especially since there seemed to > >>> be technical alignment on the last public call on March 29th. > >> > >> I guess, there is a typo in the date here. > >> It seems to be 26th, not 29th. > >> > >>> It is not feasible to do a real world implementation/POC of every design > >>> proposal. > >> > >> FWIW, I think it makes sense to PoC and test options that are > >> going to be simply unavailable going forward if not explored now. > >> Especially because we don't have any good solutions anyway > >> ("Deferral of Work" is architecturally wrong solution for OVS). > >> > > > > Hi Ilya, > > > > for those of us who haven't spent a long time working on OVS, can you > > perhaps explain a bit more as to why it is architecturally wrong? From my > > experience with DPDK, use of any lookaside accelerator, not just DMA but > > any crypto, compression or otherwise, requires asynchronous operation, and > > therefore some form of setting work aside temporarily to do other tasks. > > OVS doesn't use any lookaside accelerators and doesn't have any > infrastructure for them. > > > Let me create a DPDK analogy of what is proposed for OVS. > > DPDK has an ethdev API that abstracts different device drivers for > the application. This API has a rte_eth_tx_burst() function that > is supposed to send packets through the particular network interface. > > Imagine now that there is a network card that is not capable of > sending packets right away and requires the application to come > back later to finish the operation. That is an obvious problem, > because rte_eth_tx_burst() doesn't require any extra actions and > doesn't take ownership of packets that wasn't consumed. > > The proposed solution for this problem is to change the ethdev API: > > 1. Allow rte_eth_tx_burst() to return -EINPROGRESS that effectively > means that packets was acknowledged, but not actually sent yet. > > 2. Require the application to call the new rte_eth_process_async() > function sometime later until it doesn't return -EINPROGRESS > anymore, in case the original rte_eth_tx_burst() call returned > -EINPROGRESS. > > The main reason why this proposal is questionable: > > It's only one specific device that requires this special handling, > all other devices are capable of sending packets right away. > However, every DPDK application now has to implement some kind > of "Deferral of Work" mechanism in order to be compliant with > the updated DPDK ethdev API. > > Will DPDK make this API change? > I have no voice in DPDK API design decisions, but I'd argue against. > > Interestingly, that's not really an imaginary proposal. That is > an exact change required for DPDK ethdev API in order to add > vhost async support to the vhost ethdev driver. > > Going back to OVS: > > An oversimplified architecture of OVS has 3 layers (top to bottom): > > 1. OFproto - the layer that handles OpenFlow. > 2. Datapath Interface - packet processing. > 3. Netdev - abstraction on top of all the different port types. > > Each layer has it's own API that allows different implementations > of the same layer to be used interchangeably without any modifications > to higher layers. That's what APIs and encapsulation is for. > > So, Netdev layer has it's own API and this API is actually very > similar to the DPDK's ethdev API. Simply because they are serving > the same purpose - abstraction on top of different network interfaces. > Beside different types of DPDK ports, there are also several types > of native linux, bsd and windows ports, variety of different tunnel > ports. > > Datapath interface layer is an "application" from the ethdev analogy > above. > > What is proposed by "Deferral of Work" solution is to make pretty > much the same API change that I described, but to netdev layer API > inside the OVS, and introduce a fairly complex (and questionable, > but I'm not going into that right now) machinery to handle that API > change into the datapath interface layer. > > So, exactly the same problem is here: > > If the API change is needed only for a single port type in a very > specific hardware environment, why we need to change the common > API and rework a lot of the code in upper layers in order to accommodate > that API change, while it makes no practical sense for any other > port types or more generic hardware setups? > And similar changes will have to be done in any other DPDK application > that is not bound to a specific hardware, but wants to support vhost > async. > > The right solution, IMO, is to make vhost async behave as any other > physical NIC, since it is essentially a physical NIC now (we're not > using DMA directly, it's a combined vhost+DMA solution), instead of > propagating quirks of the single device to a common API. > > And going back to DPDK, this implementation doesn't allow use of > vhost async in the DPDK's own vhost ethdev driver. > > My initial reply to the "Deferral of Work" RFC with pretty much > the same concerns: > > https://patchwork.ozlabs.org/project/openvswitch/patch/20210907111725.43672-2-cian.ferri...@intel.com/#2751799 > > Best regards, Ilya Maximets.
Thanks for the clear explanation. Gives me a much better idea of the view from your side of things. /Bruce