Hi guys,
> > > In many usage scenarios input mbufs for rte_sched_port_enqueue() are > > not > > > yet in the CPU cache(s). That causes quite significant stalls due to > > > memory > > > latency. Current implementation tries to migitate it using SW pipeline and > > SW > > > prefetch techniques, but stalls are still present. > > > Rework rte_sched_port_enqueue() to do actual fetch of all mbufs > > metadata > > > as a first stage of that function. > > > That helps to minimise load stalls at further stages of enqueue() and > > > improves overall enqueue performance. > > > With examples/qos_sched I observed: > > > on ICX box: up to 30% cycles reduction > > > on CSX AND BDX: 20-15% cycles reduction > > > I also run tests with mbufs already in the cache (one core doing RX, QOS > > and > > > TX). > > > With such scenario, on all mentioned above IA boxes no performance drop > > > was observed. > > > > > > Signed-off-by: Konstantin Ananyev <konstantin.anan...@intel.com> > > > --- > > > v2: fix clang and checkpatch complains > > > --- > > > lib/librte_sched/rte_sched.c | 219 +++++------------------------------ > > > 1 file changed, 31 insertions(+), 188 deletions(-) > > > > > > diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c > > index > > > 7c5688068..41ef147e0 100644 > > > --- a/lib/librte_sched/rte_sched.c > > > +++ b/lib/librte_sched/rte_sched.c > > > @@ -1861,24 +1861,23 @@ debug_check_queue_slab(struct > > > rte_sched_subport *subport, uint32_t bmp_pos, #endif /* > > > RTE_SCHED_DEBUG */ > > > > > > static inline struct rte_sched_subport * -rte_sched_port_subport(struct > > > rte_sched_port *port, > > > - struct rte_mbuf *pkt) > > > +sched_port_subport(const struct rte_sched_port *port, struct > > > +rte_mbuf_sched sch) > > > { > > > - uint32_t queue_id = rte_mbuf_sched_queue_get(pkt); > > > + uint32_t queue_id = sch.queue_id; > > > uint32_t subport_id = queue_id >> (port- > > > >n_pipes_per_subport_log2 + 4); > > > > > > return port->subports[subport_id]; > > > } > > > > > > static inline uint32_t > > > -rte_sched_port_enqueue_qptrs_prefetch0(struct rte_sched_subport > > > *subport, > > > - struct rte_mbuf *pkt, uint32_t subport_qmask) > > > +sched_port_enqueue_qptrs_prefetch0(const struct rte_sched_subport > > > *subport, > > > + struct rte_mbuf_sched sch, uint32_t subport_qmask) > > > { > > > struct rte_sched_queue *q; > > > #ifdef RTE_SCHED_COLLECT_STATS > > > struct rte_sched_queue_extra *qe; > > > #endif > > > - uint32_t qindex = rte_mbuf_sched_queue_get(pkt); > > > + uint32_t qindex = sch.queue_id; > > > uint32_t subport_queue_id = subport_qmask & qindex; > > > > > > q = subport->queue + subport_queue_id; @@ -1971,197 +1970,41 > > > @@ int rte_sched_port_enqueue(struct rte_sched_port *port, struct > > > rte_mbuf **pkts, > > > uint32_t n_pkts) > > > { > > > - struct rte_mbuf *pkt00, *pkt01, *pkt10, *pkt11, *pkt20, *pkt21, > > > - *pkt30, *pkt31, *pkt_last; > > > - struct rte_mbuf **q00_base, **q01_base, **q10_base, > > > **q11_base, > > > - **q20_base, **q21_base, **q30_base, **q31_base, > > > **q_last_base; > > > - struct rte_sched_subport *subport00, *subport01, *subport10, > > > *subport11, > > > - *subport20, *subport21, *subport30, *subport31, > > > *subport_last; > > > - uint32_t q00, q01, q10, q11, q20, q21, q30, q31, q_last; > > > - uint32_t r00, r01, r10, r11, r20, r21, r30, r31, r_last; > > > - uint32_t subport_qmask; > > > uint32_t result, i; > > > + struct rte_mbuf_sched sch[n_pkts]; > > > + struct rte_sched_subport *subports[n_pkts]; > > > + struct rte_mbuf **q_base[n_pkts]; > > > + uint32_t q[n_pkts]; > > > + > > > + const uint32_t subport_qmask = > > > + (1 << (port->n_pipes_per_subport_log2 + 4)) - 1; > > > > > > result = 0; > > > - subport_qmask = (1 << (port->n_pipes_per_subport_log2 + 4)) - 1; > > > > > > - /* > > > - * Less then 6 input packets available, which is not enough to > > > - * feed the pipeline > > > - */ > > > - if (unlikely(n_pkts < 6)) { > > > - struct rte_sched_subport *subports[5]; > > > - struct rte_mbuf **q_base[5]; > > > - uint32_t q[5]; > > > - > > > - /* Prefetch the mbuf structure of each packet */ > > > - for (i = 0; i < n_pkts; i++) > > > - rte_prefetch0(pkts[i]); > > > - > > > - /* Prefetch the subport structure for each packet */ > > > - for (i = 0; i < n_pkts; i++) > > > - subports[i] = rte_sched_port_subport(port, pkts[i]); > > > - > > > - /* Prefetch the queue structure for each queue */ > > > - for (i = 0; i < n_pkts; i++) > > > - q[i] = > > > rte_sched_port_enqueue_qptrs_prefetch0(subports[i], > > > - pkts[i], subport_qmask); > > > - > > > - /* Prefetch the write pointer location of each queue */ > > > - for (i = 0; i < n_pkts; i++) { > > > - q_base[i] = > > > rte_sched_subport_pipe_qbase(subports[i], q[i]); > > > - rte_sched_port_enqueue_qwa_prefetch0(port, > > > subports[i], > > > - q[i], q_base[i]); > > > - } > > > + /* Prefetch the mbuf structure of each packet */ > > > + for (i = 0; i < n_pkts; i++) > > > + sch[i] = pkts[i]->hash.sched; > > > > > > > Hi Konstantin, thanks for the patch. In above case, all packets are touched > > straight with any prefetch. If we consider the input burst size of 64 pkts, > > it > > means 512 bytes of packet addresses (8 cache-lines) which is likely to be > > available in cache. For larger size burst, e.g. 128 or 256, there might be > > instances when some addresses are not available the cache, may stall core. > > How about adding explicit prefetch before starting to iterate through the > > packets if that helps? I don't think we need any prefetch here. pkts[] is a sequential array, HW prefetcher should be able to do good job here. Again in majority of use-cases pkts[] contents will already present in the cache. Though there is a valid concern here: n_pkts can be big, in that case we probably don't want to store too much on the stack and read too much from pkts[]. It is better to work in some fixed chunks (64 or so). I can prepare v2 with these changes, if there still is an interest in this patch. > Exactly. Konstantin, you might not be a fan of prefetches, but the current > enqueue implementation (as well as the dequeue) uses a prefetch > state machine. Please keep the prefetch state machine in the scalar code. It is not about our own preferences. >From my measurements new version is faster and it is definitely simpler. > Even if the examples/qos_sched might not show an advantage, > this is just a sample app and there are some more relevant use-cases as well. Well, I hope that examples/qos_sched reflects at least some real-world use-cases for QOS library. Otherwise why do we have it inside DPDK codebase? About 'more relevant use-cases': if you do know such, can you try them with the patch? I would really appreciate that. In fact, it is an ask not only to Cristian, but to all other interested parties: if your app does use librte_sched - please try this patch and provide the feedback. If some tests would flag a regression - I am absolutely ok to drop the patch. Konstantin