> -----Original Message----- > From: Xueming(Steven) Li <xuemi...@nvidia.com> > Sent: Wednesday, September 29, 2021 1:09 PM > To: jerinjac...@gmail.com; Ananyev, Konstantin <konstantin.anan...@intel.com> > Cc: NBU-Contact-Thomas Monjalon <tho...@monjalon.net>; > andrew.rybche...@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh > <ferruh.yi...@intel.com> > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue> > On Wed, 2021-09-29 at 09:52 +0000, Ananyev, Konstantin wrote: > > > > > -----Original Message----- > > > From: Xueming(Steven) Li <xuemi...@nvidia.com> > > > Sent: Wednesday, September 29, 2021 10:13 AM > > > To: jerinjac...@gmail.com; Ananyev, Konstantin > > > <konstantin.anan...@intel.com> > > > Cc: NBU-Contact-Thomas Monjalon <tho...@monjalon.net>; > > > andrew.rybche...@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh > > > <ferruh.yi...@intel.com> > > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue > > > > > > On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote: > > > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue > > > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > pre-loaded with mbufs > > > > > > > > > > > > > > > > > > for incoming packets. When number of > > > > > > > > > > > > > > > > > > representors scale out in a > > > > > > > > > > > > > > > > > > switch domain, the memory consumption > > > > > > > > > > > > > > > > > > became > > > > > > > > > > > > > > > > > > significant. Most > > > > > > > > > > > > > > > > > > important, polling all ports leads to > > > > > > > > > > > > > > > > > > high > > > > > > > > > > > > > > > > > > cache miss, high > > > > > > > > > > > > > > > > > > latency and low throughput. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This patch introduces shared RX queue. > > > > > > > > > > > > > > > > > > Ports > > > > > > > > > > > > > > > > > > with same > > > > > > > > > > > > > > > > > > configuration in a switch domain could > > > > > > > > > > > > > > > > > > share > > > > > > > > > > > > > > > > > > RX queue set by specifying sharing group. > > > > > > > > > > > > > > > > > > Polling any queue using same shared RX > > > > > > > > > > > > > > > > > > queue > > > > > > > > > > > > > > > > > > receives packets from > > > > > > > > > > > > > > > > > > all member ports. Source port is > > > > > > > > > > > > > > > > > > identified > > > > > > > > > > > > > > > > > > by mbuf->port. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Port queue number in a shared group > > > > > > > > > > > > > > > > > > should be > > > > > > > > > > > > > > > > > > identical. Queue > > > > > > > > > > > > > > > > > > index is > > > > > > > > > > > > > > > > > > 1:1 mapped in shared group. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled > > > > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > > > same thread. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Is this offload specific to the > > > > > > > > > > > > > > > > > representor? If > > > > > > > > > > > > > > > > > so can this name be changed specifically to > > > > > > > > > > > > > > > > > representor? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yes, PF and representor in switch domain > > > > > > > > > > > > > > > > could > > > > > > > > > > > > > > > > take advantage. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If it is for a generic case, how the flow > > > > > > > > > > > > > > > > > ordering will be maintained? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Not quite sure that I understood your > > > > > > > > > > > > > > > > question. > > > > > > > > > > > > > > > > The control path of is > > > > > > > > > > > > > > > > almost same as before, PF and representor > > > > > > > > > > > > > > > > port > > > > > > > > > > > > > > > > still needed, rte flows not impacted. > > > > > > > > > > > > > > > > Queues still needed for each member port, > > > > > > > > > > > > > > > > descriptors(mbuf) will be > > > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD > > > > > > > > > > > > > > > > implementation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > My question was if create a generic > > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple > > > > > > > > > > > > > > > ethdev receive queues land into > > > > > > > the same > > > > > > > > > > > > > > > receive queue, In that case, how the flow order > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > maintained for respective receive queues. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I guess the question is testpmd forward stream? > > > > > > > > > > > > > > The > > > > > > > > > > > > > > forwarding logic has to be changed slightly in > > > > > > > > > > > > > > case > > > > > > > > > > > > > > of shared rxq. > > > > > > > > > > > > > > basically for each packet in rx_burst result, > > > > > > > > > > > > > > lookup > > > > > > > > > > > > > > source stream according to mbuf->port, forwarding > > > > > > > > > > > > > > to > > > > > > > > > > > > > > target fs. > > > > > > > > > > > > > > Packets from same source port could be grouped as > > > > > > > > > > > > > > a > > > > > > > > > > > > > > small burst to process, this will accelerates the > > > > > > > > > > > > > > performance if traffic > > > > > > > come from > > > > > > > > > > > > > > limited ports. I'll introduce some common api to > > > > > > > > > > > > > > do > > > > > > > > > > > > > > shard rxq forwarding, call it with packets > > > > > > > > > > > > > > handling > > > > > > > > > > > > > > callback, so it suites for > > > > > > > > > > > > > > all forwarding engine. Will sent patches soon. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > All ports will put the packets in to the same queue > > > > > > > > > > > > > (share queue), right? Does > > > > > > > > > > > > > this means only single core will poll only, what > > > > > > > > > > > > > will > > > > > > > > > > > > > happen if there are > > > > > > > > > > > > > multiple cores polling, won't it cause problem? > > > > > > > > > > > > > > > > > > > > > > > > > > And if this requires specific changes in the > > > > > > > > > > > > > application, I am not sure about > > > > > > > > > > > > > the solution, can't this work in a transparent way > > > > > > > > > > > > > to > > > > > > > > > > > > > the application? > > > > > > > > > > > > > > > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8 > > > > > > > > > > > > that > > > > > > > > > > > > aggregate ports > > > > > > > > > > > > in same group into one new port. Users could schedule > > > > > > > > > > > > polling on the > > > > > > > > > > > > aggregated port instead of all member ports. > > > > > > > > > > > > > > > > > > > > > > The v3 still has testpmd changes in fastpath. Right? > > > > > > > > > > > IMO, > > > > > > > > > > > For this > > > > > > > > > > > feature, we should not change fastpath of testpmd > > > > > > > > > > > application. Instead, testpmd can use aggregated ports > > > > > > > > > > > probably as > > > > > > > > > > > separate fwd_engine to show how to use this feature. > > > > > > > > > > > > > > > > > > > > Good point to discuss :) There are two strategies to > > > > > > > > > > polling > > > > > > > > > > a shared > > > > > > > > > > Rxq: > > > > > > > > > > 1. polling each member port > > > > > > > > > > All forwarding engines can be reused to work as > > > > > > > > > > before. > > > > > > > > > > My testpmd patches are efforts towards this direction. > > > > > > > > > > Does your PMD support this? > > > > > > > > > > > > > > > > > > Not unfortunately. More than that, every application needs > > > > > > > > > to > > > > > > > > > change > > > > > > > > > to support this model. > > > > > > > > > > > > > > > > Both strategies need user application to resolve port ID from > > > > > > > > mbuf and > > > > > > > > process accordingly. > > > > > > > > This one doesn't demand aggregated port, no polling schedule > > > > > > > > change. > > > > > > > > > > > > > > I was thinking, mbuf will be updated from driver/aggregator > > > > > > > port as > > > > > > > when it > > > > > > > comes to application. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. polling aggregated port > > > > > > > > > > Besides forwarding engine, need more work to to demo > > > > > > > > > > it. > > > > > > > > > > This is an optional API, not supported by my PMD yet. > > > > > > > > > > > > > > > > > > We are thinking of implementing this PMD when it comes to > > > > > > > > > it, > > > > > > > > > ie. > > > > > > > > > without application change in fastpath > > > > > > > > > logic. > > > > > > > > > > > > > > > > Fastpath have to resolve port ID anyway and forwarding > > > > > > > > according > > > > > > > > to > > > > > > > > logic. Forwarding engine need to adapt to support shard Rxq. > > > > > > > > Fortunately, in testpmd, this can be done with an abstract > > > > > > > > API. > > > > > > > > > > > > > > > > Let's defer part 2 until some PMD really support it and > > > > > > > > tested, > > > > > > > > how do > > > > > > > > you think? > > > > > > > > > > > > > > We are not planning to use this feature so either way it is OK > > > > > > > to > > > > > > > me. > > > > > > > I leave to ethdev maintainers decide between 1 vs 2. > > > > > > > > > > > > > > I do have a strong opinion not changing the testpmd basic > > > > > > > forward > > > > > > > engines > > > > > > > for this feature.I would like to keep it simple as fastpath > > > > > > > optimized and would > > > > > > > like to add a separate Forwarding engine as means to verify > > > > > > > this > > > > > > > feature. > > > > > > > > > > > > +1 to that. > > > > > > I don't think it a 'common' feature. > > > > > > So separate FWD mode seems like a best choice to me. > > > > > > > > > > -1 :) > > > > > There was some internal requirement from test team, they need to > > > > > verify > > > > > all features like packet content, rss, vlan, checksum, rte_flow... > > > > > to > > > > > be working based on shared rx queue. > > > > > > > > Then I suppose you'll need to write really comprehensive fwd-engine > > > > to satisfy your test team :) > > > > Speaking seriously, I still don't understand why do you need all > > > > available fwd-engines to verify this feature. > > > > From what I understand, main purpose of your changes to test-pmd: > > > > allow to fwd packet though different fwd_stream (TX through different > > > > HW queue). > > > > In theory, if implemented in generic and extendable way - that > > > > might be a useful add-on to tespmd fwd functionality. > > > > But current implementation looks very case specific. > > > > And as I don't think it is a common case, I don't see much point to > > > > pollute > > > > basic fwd cases with it. > > > > > > > > BTW, as a side note, the code below looks bogus to me: > > > > +void > > > > +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx, > > > > + struct rte_mbuf **pkts_burst, packet_fwd_cb > > > > fwd) > > > > +{ > > > > + uint16_t i, nb_fs_rx = 1, port; > > > > + > > > > + /* Locate real source fs according to mbuf->port. */ > > > > + for (i = 0; i < nb_rx; ++i) { > > > > + rte_prefetch0(pkts_burst[i + 1]); > > > > > > > > you access pkt_burst[] beyond array boundaries, > > > > also you ask cpu to prefetch some unknown and possibly invalid > > > > address. > > > > > > Sorry I forgot this topic. It's too late to prefetch current packet, so > > > perfetch next is better. Prefetch an invalid address at end of a look > > > doesn't hurt, it's common in DPDK. > > > > First of all it is usually never 'OK' to access array beyond its bounds. > > Second prefetching invalid address *does* hurt performance badly on many > > CPUs > > (TLB misses, consumed memory bandwidth etc.). > > As a reference: https://lwn.net/Articles/444346/ > > If some existing DPDK code really does that - then I believe it is an issue > > and has to be addressed. > > More important - it is really bad attitude to submit bogus code to DPDK > > community > > and pretend that it is 'OK'. > > Thanks for the link! > From instruction spec, "The PREFETCHh instruction is merely a hint and > does not affect program behavior." > There are 3 choices here: > 1: no prefetch. D$ miss will happen on each packet, time cost depends > on where data sits(close or far) and burst size. > 2: prefetch with loop end check to avoid random address. Pro is free of > TLB miss per burst, Cons is "if" instruction per packet. Cost depends > on burst size. > 3: brute force prefetch, cost is TLB miss, but no addtional > instructions per packet. Not sure how random the last address could be > in testpmd and how many TLB miss could happen.
There are plenty of standard techniques to avoid that issue while keeping prefetch() in place. Probably the easiest one: for (i=0; i < nb_rx - 1; i++) { prefetch(pkt[i+1]; /* do your stuff with pkt[i[ here */ } /* do your stuff with pkt[nb_rx - 1]; */ > Based on my expericen of performance optimization, IIRC, option 3 has > the best performance. But for this case, result depends on how many > sub-burst inside and how sub-burst get processed, maybe callback will > flush prefetch data completely or not. So it's hard to get a > conclusion, what I said is that the code in PMD driver should have a > reason. > > On the other hand, the latency and throughput saving of this featue on > multiple ports is huge, I perfer to down play this prefetch discussion > if you agree. > Honestly, I don't know how else to explain to you that there is a bug in that piece of code. From my perspective it is a trivial bug, with a trivial fix. But you simply keep ignoring the arguments. Till it get fixed and other comments addressed - my vote is NACK for these series. I don't think we need bogus code in testpmd.