> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Ananyev, > Konstantin > Sent: Thursday, November 5, 2020 1:26 AM > > > > > Hi, > > > > On Tue, Nov 03, 2020 at 04:03:46PM +0100, Morten Brørup wrote: > > > > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Slava > Ovsiienko > > > > Sent: Tuesday, November 3, 2020 3:03 PM > > > > > > > > Hi, Morten > > > > > > > > > From: Morten Brørup <m...@smartsharesystems.com> > > > > > Sent: Tuesday, November 3, 2020 14:10 > > > > > > > > > > > From: Thomas Monjalon [mailto:tho...@monjalon.net] > > > > > > Sent: Monday, November 2, 2020 4:58 PM > > > > > > > > > > > > +Cc techboard > > > > > > > > > > > > We need benchmark numbers in order to take a decision. > > > > > > Please all, prepare some arguments and numbers so we can > discuss > > > > the > > > > > > mbuf layout in the next techboard meeting. > > > > I did some quick tests, and it appears to me that just moving the > pool > > pointer to the first cache line has not a significant impact. > > Hmm, as I remember Thomas mentioned about 5%+ improvement > with that change. Though I suppose a lot depends from actual test-case. > Would be good to know when it does help and when it doesn't. > > > > > However, I agree with Morten that there is some room for optimization > > around m->pool: I did a hack in the ixgbe driver to assume there is > only > > one mbuf pool. This simplifies a lot the freeing of mbufs in Tx, > because > > we don't have to group them in bulks that shares the same pool (see > > ixgbe_tx_free_bufs()). The impact of this hack is quite good: +~5% on > a > > real-life forwarding use case. > > I think we already have such optimization ability within DPDK: > #define DEV_TX_OFFLOAD_MBUF_FAST_FREE 0x00010000 > /**< Device supports optimization for fast release of mbufs. > * When set application must guarantee that per-queue all mbufs comes > from > * the same mempool and has refcnt = 1. > */ > > Seems over-optimistic to me, but many PMDs do support it.
Looking at a few drivers using this flag, Intel drivers use rte_mempool_put(m->pool), and thus still reads the second cache line. Only ThunderX seems to use the optimization benefit and use rte_mempool_put_bulk(q->pool). I would rather see a generic optimization of free()'ing non-segmented packets in the mbuf library, where free() and free_seg() take advantage of knowing whether they are working on the first segment or not - like the is_header indicator in many of the mbuf check functions - and, when working on the first segment, gate access to n->next by m->nb_segs > 1. Concept: static inline void rte_pktmbuf_free(struct rte_mbuf *m) { struct rte_mbuf *m_next; /* NOTE: Sanity check of header has moved to __rte_pktmbuf_prefree_seg(). */ if (m != NULL) { if (m->nb_segs == 1) { __rte_pktmbuf_free_seg(m, 1); } else { m_next = m->next; __rte_pktmbuf_free_seg(m, 1); m = m_next; while (m != NULL) { m_next = m->next; __rte_pktmbuf_free_seg(m, 0); m = m_next; } } } } static __rte_always_inline void __rte_pktmbuf_free_seg(struct rte_mbuf *m, int is_header) { m = __rte_pktmbuf_prefree_seg(m, is_header); if (likely(m != NULL)) rte_mbuf_raw_free(m); } static __rte_always_inline struct rte_mbuf * __rte_pktmbuf_prefree_seg(struct rte_mbuf *m, int is_header) { __rte_mbuf_sanity_check(m, is_header); if (likely(rte_mbuf_refcnt_read(m) == 1)) { if (!RTE_MBUF_DIRECT(m)) { rte_pktmbuf_detach(m); if (RTE_MBUF_HAS_EXTBUF(m) && RTE_MBUF_HAS_PINNED_EXTBUF(m) && __rte_pktmbuf_pinned_extbuf_decref(m)) return NULL; } if (is_header && m->nb_segs == 1) return m; /* NOTE: Avoid touching (writing to) the second cache line. */ if (m->next != NULL) { m->next = NULL; m->nb_segs = 1; } return m; } else if (__rte_mbuf_refcnt_update(m, -1) == 0) { if (!RTE_MBUF_DIRECT(m)) { rte_pktmbuf_detach(m); if (RTE_MBUF_HAS_EXTBUF(m) && RTE_MBUF_HAS_PINNED_EXTBUF(m) && __rte_pktmbuf_pinned_extbuf_decref(m)) return NULL; } if (is_header && m->nb_segs == 1) { /* NOTE: Avoid touching (writing to) the second cache line. */ rte_mbuf_refcnt_set(m, 1); return m; } if (m->next != NULL) { m->next = NULL; m->nb_segs = 1; } rte_mbuf_refcnt_set(m, 1); return m; } return NULL; } Furthermore, this concept might provide an additional performance improvement by moving m->pool to the first cache line, so rte_mempool_put() in rte_mbuf_raw_free() wouldn't have to touch the second cache line either.