> -----Original Message-----
> From: Ananyev, Konstantin <konstantin.anan...@intel.com>
> Sent: Thursday, March 18, 2021 11:56 AM
> To: dev@dpdk.org
> Cc: Dumitrescu, Cristian <cristian.dumitre...@intel.com>; Singh, Jasvinder
> <jasvinder.si...@intel.com>; Ananyev, Konstantin
> <konstantin.anan...@intel.com>
> Subject: [PATCH v2 2/2] qos: rearrange enqueue procedure
>
> In many usage scenarios input mbufs for rte_sched_port_enqueue() are not
> yet in the CPU cache(s). That causes quite significant stalls due to memory
> latency. Current implementation tries to migitate it using SW pipeline and SW
> prefetch techniques, but stalls are still present.
> Rework rte_sched_port_enqueue() to do actual fetch of all mbufs metadata
> as a first stage of that function.
> That helps to minimise load stalls at further stages of enqueue() and
> improves overall enqueue performance.
> With examples/qos_sched I observed:
> on ICX box: up to 30% cycles reduction
> on CSX AND BDX: 20-15% cycles reduction
> I also run tests with mbufs already in the cache (one core doing RX, QOS and
> TX).
> With such scenario, on all mentioned above IA boxes no performance drop
> was observed.
>
> Signed-off-by: Konstantin Ananyev <konstantin.anan...@intel.com>
> ---
> v2: fix clang and checkpatch complains
> ---
> lib/librte_sched/rte_sched.c | 219 +++++------------------------------
> 1 file changed, 31 insertions(+), 188 deletions(-)
>
> diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c index
> 7c5688068..41ef147e0 100644
> --- a/lib/librte_sched/rte_sched.c
> +++ b/lib/librte_sched/rte_sched.c
> @@ -1861,24 +1861,23 @@ debug_check_queue_slab(struct
> rte_sched_subport *subport, uint32_t bmp_pos, #endif /*
> RTE_SCHED_DEBUG */
>
> static inline struct rte_sched_subport * -rte_sched_port_subport(struct
> rte_sched_port *port,
> - struct rte_mbuf *pkt)
> +sched_port_subport(const struct rte_sched_port *port, struct
> +rte_mbuf_sched sch)
> {
> - uint32_t queue_id = rte_mbuf_sched_queue_get(pkt);
> + uint32_t queue_id = sch.queue_id;
> uint32_t subport_id = queue_id >> (port-
> >n_pipes_per_subport_log2 + 4);
>
> return port->subports[subport_id];
> }
>
> static inline uint32_t
> -rte_sched_port_enqueue_qptrs_prefetch0(struct rte_sched_subport
> *subport,
> - struct rte_mbuf *pkt, uint32_t subport_qmask)
> +sched_port_enqueue_qptrs_prefetch0(const struct rte_sched_subport
> *subport,
> + struct rte_mbuf_sched sch, uint32_t subport_qmask)
> {
> struct rte_sched_queue *q;
> #ifdef RTE_SCHED_COLLECT_STATS
> struct rte_sched_queue_extra *qe;
> #endif
> - uint32_t qindex = rte_mbuf_sched_queue_get(pkt);
> + uint32_t qindex = sch.queue_id;
> uint32_t subport_queue_id = subport_qmask & qindex;
>
> q = subport->queue + subport_queue_id; @@ -1971,197 +1970,41
> @@ int rte_sched_port_enqueue(struct rte_sched_port *port, struct
> rte_mbuf **pkts,
> uint32_t n_pkts)
> {
> - struct rte_mbuf *pkt00, *pkt01, *pkt10, *pkt11, *pkt20, *pkt21,
> - *pkt30, *pkt31, *pkt_last;
> - struct rte_mbuf **q00_base, **q01_base, **q10_base,
> **q11_base,
> - **q20_base, **q21_base, **q30_base, **q31_base,
> **q_last_base;
> - struct rte_sched_subport *subport00, *subport01, *subport10,
> *subport11,
> - *subport20, *subport21, *subport30, *subport31,
> *subport_last;
> - uint32_t q00, q01, q10, q11, q20, q21, q30, q31, q_last;
> - uint32_t r00, r01, r10, r11, r20, r21, r30, r31, r_last;
> - uint32_t subport_qmask;
> uint32_t result, i;
> + struct rte_mbuf_sched sch[n_pkts];
> + struct rte_sched_subport *subports[n_pkts];
> + struct rte_mbuf **q_base[n_pkts];
> + uint32_t q[n_pkts];
> +
> + const uint32_t subport_qmask =
> + (1 << (port->n_pipes_per_subport_log2 + 4)) - 1;
>
> result = 0;
> - subport_qmask = (1 << (port->n_pipes_per_subport_log2 + 4)) - 1;
>
> - /*
> - * Less then 6 input packets available, which is not enough to
> - * feed the pipeline
> - */
> - if (unlikely(n_pkts < 6)) {
> - struct rte_sched_subport *subports[5];
> - struct rte_mbuf **q_base[5];
> - uint32_t q[5];
> -
> - /* Prefetch the mbuf structure of each packet */
> - for (i = 0; i < n_pkts; i++)
> - rte_prefetch0(pkts[i]);
> -
> - /* Prefetch the subport structure for each packet */
> - for (i = 0; i < n_pkts; i++)
> - subports[i] = rte_sched_port_subport(port, pkts[i]);
> -
> - /* Prefetch the queue structure for each queue */
> - for (i = 0; i < n_pkts; i++)
> - q[i] =
> rte_sched_port_enqueue_qptrs_prefetch0(subports[i],
> - pkts[i], subport_qmask);
> -
> - /* Prefetch the write pointer location of each queue */
> - for (i = 0; i < n_pkts; i++) {
> - q_base[i] =
> rte_sched_subport_pipe_qbase(subports[i], q[i]);
> - rte_sched_port_enqueue_qwa_prefetch0(port,
> subports[i],
> - q[i], q_base[i]);
> - }
> + /* Prefetch the mbuf structure of each packet */
> + for (i = 0; i < n_pkts; i++)
> + sch[i] = pkts[i]->hash.sched;
>
Hi Konstantin, thanks for the patch. In above case, all packets are touched
straight with any prefetch. If we consider the input burst size of 64 pkts, it
means 512 bytes of packet addresses (8 cache-lines) which is likely to be
available in cache. For larger size burst, e.g. 128 or 256, there might be
instances when some addresses are not available the cache, may stall core. How
about adding explicit prefetch before starting to iterate through the packets
if that helps?