> -----Original Message-----
> From: Xueming(Steven) Li <xuemi...@nvidia.com>
> Sent: Wednesday, September 29, 2021 1:09 PM
> To: jerinjac...@gmail.com; Ananyev, Konstantin <konstantin.anan...@intel.com>
> Cc: NBU-Contact-Thomas Monjalon <tho...@monjalon.net>; 
> andrew.rybche...@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
> <ferruh.yi...@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue> 
> On Wed, 2021-09-29 at 09:52 +0000, Ananyev, Konstantin wrote:
> >
> > > -----Original Message-----
> > > From: Xueming(Steven) Li <xuemi...@nvidia.com>
> > > Sent: Wednesday, September 29, 2021 10:13 AM
> > > To: jerinjac...@gmail.com; Ananyev, Konstantin 
> > > <konstantin.anan...@intel.com>
> > > Cc: NBU-Contact-Thomas Monjalon <tho...@monjalon.net>; 
> > > andrew.rybche...@oktetlabs.ru; dev@dpdk.org; Yigit, Ferruh
> > > <ferruh.yi...@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v1] ethdev: introduce shared Rx queue
> > >
> > > On Wed, 2021-09-29 at 00:26 +0000, Ananyev, Konstantin wrote:
> > > > > > > > > > > > > > > > > > In current DPDK framework, each RX queue
> > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > pre-loaded with mbufs
> > > > > > > > > > > > > > > > > > for incoming packets. When number of
> > > > > > > > > > > > > > > > > > representors scale out in a
> > > > > > > > > > > > > > > > > > switch domain, the memory consumption
> > > > > > > > > > > > > > > > > > became
> > > > > > > > > > > > > > > > > > significant. Most
> > > > > > > > > > > > > > > > > > important, polling all ports leads to
> > > > > > > > > > > > > > > > > > high
> > > > > > > > > > > > > > > > > > cache miss, high
> > > > > > > > > > > > > > > > > > latency and low throughput.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This patch introduces shared RX queue.
> > > > > > > > > > > > > > > > > > Ports
> > > > > > > > > > > > > > > > > > with same
> > > > > > > > > > > > > > > > > > configuration in a switch domain could
> > > > > > > > > > > > > > > > > > share
> > > > > > > > > > > > > > > > > > RX queue set by specifying sharing group.
> > > > > > > > > > > > > > > > > > Polling any queue using same shared RX
> > > > > > > > > > > > > > > > > > queue
> > > > > > > > > > > > > > > > > > receives packets from
> > > > > > > > > > > > > > > > > > all member ports. Source port is
> > > > > > > > > > > > > > > > > > identified
> > > > > > > > > > > > > > > > > > by mbuf->port.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Port queue number in a shared group
> > > > > > > > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > > > identical. Queue
> > > > > > > > > > > > > > > > > > index is
> > > > > > > > > > > > > > > > > > 1:1 mapped in shared group.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Share RX queue is supposed to be polled
> > > > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > > same thread.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Multiple groups is supported by group ID.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Is this offload specific to the
> > > > > > > > > > > > > > > > > representor? If
> > > > > > > > > > > > > > > > > so can this name be changed specifically to
> > > > > > > > > > > > > > > > > representor?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Yes, PF and representor in switch domain
> > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > take advantage.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > If it is for a generic case, how the flow
> > > > > > > > > > > > > > > > > ordering will be maintained?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Not quite sure that I understood your
> > > > > > > > > > > > > > > > question.
> > > > > > > > > > > > > > > > The control path of is
> > > > > > > > > > > > > > > > almost same as before, PF and representor
> > > > > > > > > > > > > > > > port
> > > > > > > > > > > > > > > > still needed, rte flows not impacted.
> > > > > > > > > > > > > > > > Queues still needed for each member port,
> > > > > > > > > > > > > > > > descriptors(mbuf) will be
> > > > > > > > > > > > > > > > supplied from shared Rx queue in my PMD
> > > > > > > > > > > > > > > > implementation.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > My question was if create a generic
> > > > > > > > > > > > > > > RTE_ETH_RX_OFFLOAD_SHARED_RXQ offload, multiple
> > > > > > > > > > > > > > > ethdev receive queues land into
> > > > > > > the same
> > > > > > > > > > > > > > > receive queue, In that case, how the flow order
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > maintained for respective receive queues.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I guess the question is testpmd forward stream?
> > > > > > > > > > > > > > The
> > > > > > > > > > > > > > forwarding logic has to be changed slightly in
> > > > > > > > > > > > > > case
> > > > > > > > > > > > > > of shared rxq.
> > > > > > > > > > > > > > basically for each packet in rx_burst result,
> > > > > > > > > > > > > > lookup
> > > > > > > > > > > > > > source stream according to mbuf->port, forwarding
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > target fs.
> > > > > > > > > > > > > > Packets from same source port could be grouped as
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > small burst to process, this will accelerates the
> > > > > > > > > > > > > > performance if traffic
> > > > > > > come from
> > > > > > > > > > > > > > limited ports. I'll introduce some common api to
> > > > > > > > > > > > > > do
> > > > > > > > > > > > > > shard rxq forwarding, call it with packets
> > > > > > > > > > > > > > handling
> > > > > > > > > > > > > > callback, so it suites for
> > > > > > > > > > > > > > all forwarding engine. Will sent patches soon.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > All ports will put the packets in to the same queue
> > > > > > > > > > > > > (share queue), right? Does
> > > > > > > > > > > > > this means only single core will poll only, what
> > > > > > > > > > > > > will
> > > > > > > > > > > > > happen if there are
> > > > > > > > > > > > > multiple cores polling, won't it cause problem?
> > > > > > > > > > > > >
> > > > > > > > > > > > > And if this requires specific changes in the
> > > > > > > > > > > > > application, I am not sure about
> > > > > > > > > > > > > the solution, can't this work in a transparent way
> > > > > > > > > > > > > to
> > > > > > > > > > > > > the application?
> > > > > > > > > > > >
> > > > > > > > > > > > Discussed with Jerin, new API introduced in v3 2/8
> > > > > > > > > > > > that
> > > > > > > > > > > > aggregate ports
> > > > > > > > > > > > in same group into one new port. Users could schedule
> > > > > > > > > > > > polling on the
> > > > > > > > > > > > aggregated port instead of all member ports.
> > > > > > > > > > >
> > > > > > > > > > > The v3 still has testpmd changes in fastpath. Right?
> > > > > > > > > > > IMO,
> > > > > > > > > > > For this
> > > > > > > > > > > feature, we should not change fastpath of testpmd
> > > > > > > > > > > application. Instead, testpmd can use aggregated ports
> > > > > > > > > > > probably as
> > > > > > > > > > > separate fwd_engine to show how to use this feature.
> > > > > > > > > >
> > > > > > > > > > Good point to discuss :) There are two strategies to
> > > > > > > > > > polling
> > > > > > > > > > a shared
> > > > > > > > > > Rxq:
> > > > > > > > > > 1. polling each member port
> > > > > > > > > >    All forwarding engines can be reused to work as
> > > > > > > > > > before.
> > > > > > > > > >    My testpmd patches are efforts towards this direction.
> > > > > > > > > >    Does your PMD support this?
> > > > > > > > >
> > > > > > > > > Not unfortunately. More than that, every application needs
> > > > > > > > > to
> > > > > > > > > change
> > > > > > > > > to support this model.
> > > > > > > >
> > > > > > > > Both strategies need user application to resolve port ID from
> > > > > > > > mbuf and
> > > > > > > > process accordingly.
> > > > > > > > This one doesn't demand aggregated port, no polling schedule
> > > > > > > > change.
> > > > > > >
> > > > > > > I was thinking, mbuf will be updated from driver/aggregator
> > > > > > > port as
> > > > > > > when it
> > > > > > > comes to application.
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > 2. polling aggregated port
> > > > > > > > > >    Besides forwarding engine, need more work to to demo
> > > > > > > > > > it.
> > > > > > > > > >    This is an optional API, not supported by my PMD yet.
> > > > > > > > >
> > > > > > > > > We are thinking of implementing this PMD when it comes to
> > > > > > > > > it,
> > > > > > > > > ie.
> > > > > > > > > without application change in fastpath
> > > > > > > > > logic.
> > > > > > > >
> > > > > > > > Fastpath have to resolve port ID anyway and forwarding
> > > > > > > > according
> > > > > > > > to
> > > > > > > > logic. Forwarding engine need to adapt to support shard Rxq.
> > > > > > > > Fortunately, in testpmd, this can be done with an abstract
> > > > > > > > API.
> > > > > > > >
> > > > > > > > Let's defer part 2 until some PMD really support it and
> > > > > > > > tested,
> > > > > > > > how do
> > > > > > > > you think?
> > > > > > >
> > > > > > > We are not planning to use this feature so either way it is OK
> > > > > > > to
> > > > > > > me.
> > > > > > > I leave to ethdev maintainers decide between 1 vs 2.
> > > > > > >
> > > > > > > I do have a strong opinion not changing the testpmd basic
> > > > > > > forward
> > > > > > > engines
> > > > > > > for this feature.I would like to keep it simple as fastpath
> > > > > > > optimized and would
> > > > > > > like to add a separate Forwarding engine as means to verify
> > > > > > > this
> > > > > > > feature.
> > > > > >
> > > > > > +1 to that.
> > > > > > I don't think it a 'common' feature.
> > > > > > So separate FWD mode seems like a best choice to me.
> > > > >
> > > > > -1 :)
> > > > > There was some internal requirement from test team, they need to
> > > > > verify
> > > > > all features like packet content, rss, vlan, checksum, rte_flow...
> > > > > to
> > > > > be working based on shared rx queue.
> > > >
> > > > Then I suppose you'll need to write really comprehensive fwd-engine
> > > > to satisfy your test team :)
> > > > Speaking seriously, I still don't understand why do you need all
> > > > available fwd-engines to verify this feature.
> > > > From what I understand, main purpose of your changes to test-pmd:
> > > > allow to fwd packet though different fwd_stream (TX through different
> > > > HW queue).
> > > > In theory, if implemented in generic and extendable way - that
> > > > might be a useful add-on to tespmd fwd functionality.
> > > > But current implementation looks very case specific.
> > > > And as I don't think it is a common case, I don't see much point to
> > > > pollute
> > > > basic fwd cases with it.
> > > >
> > > > BTW, as a side note, the code below looks bogus to me:
> > > > +void
> > > > +forward_shared_rxq(struct fwd_stream *fs, uint16_t nb_rx,
> > > > +                  struct rte_mbuf **pkts_burst, packet_fwd_cb
> > > > fwd)
> > > > +{
> > > > +       uint16_t i, nb_fs_rx = 1, port;
> > > > +
> > > > +       /* Locate real source fs according to mbuf->port. */
> > > > +       for (i = 0; i < nb_rx; ++i) {
> > > > +               rte_prefetch0(pkts_burst[i + 1]);
> > > >
> > > > you access pkt_burst[] beyond array boundaries,
> > > > also you ask cpu to prefetch some unknown and possibly invalid
> > > > address.
> > >
> > > Sorry I forgot this topic. It's too late to prefetch current packet, so
> > > perfetch next is better. Prefetch an invalid address at end of a look
> > > doesn't hurt, it's common in DPDK.
> >
> > First of all it is usually never 'OK' to access array beyond its bounds.
> > Second prefetching invalid address *does* hurt performance badly on many 
> > CPUs
> > (TLB misses, consumed memory bandwidth etc.).
> > As a reference:  https://lwn.net/Articles/444346/
> > If some existing DPDK code really does that - then I believe it is an issue 
> > and has to be addressed.
> > More important - it is really bad attitude to submit bogus code to DPDK 
> > community
> > and pretend that it is 'OK'.
> 
> Thanks for the link!
> From instruction spec, "The PREFETCHh instruction is merely a hint and
> does not affect program behavior."
> There are 3 choices here:
> 1: no prefetch. D$ miss will happen on each packet, time cost depends
> on where data sits(close or far) and burst size.
> 2: prefetch with loop end check to avoid random address. Pro is free of
> TLB miss per burst, Cons is "if" instruction per packet. Cost depends
> on burst size.
> 3: brute force prefetch, cost is TLB miss, but no addtional
> instructions per packet. Not sure how random the last address could be
> in testpmd and how many TLB miss could happen.

There are plenty of standard techniques to avoid that issue while keeping
prefetch() in place.
Probably the easiest one:

for (i=0; i < nb_rx - 1; i++) {
    prefetch(pkt[i+1];
    /* do your stuff with pkt[i[ here */
}

/* do your stuff with pkt[nb_rx - 1]; */
 
> Based on my expericen of performance optimization, IIRC, option 3 has
> the best performance. But for this case, result depends on how many
> sub-burst inside and how sub-burst get processed, maybe callback will
> flush prefetch data completely or not. So it's hard to get a
> conclusion, what I said is that the code in PMD driver should have a
> reason.
> 
> On the other hand, the latency and throughput saving of this featue on
> multiple ports is huge, I perfer to down play this prefetch discussion
> if you agree.
> 

Honestly, I don't know how else to explain to you that there is a bug in that 
piece of code.
From my perspective it is a trivial bug, with a trivial fix.
But you simply keep ignoring the arguments.
Till it get fixed and other comments addressed - my vote is NACK for these 
series.
I don't think we need bogus code in testpmd.


Reply via email to