> 
> Executive Summary:
> 
> My analysis shows that the mbuf library is not a barrier for fast-freeing
> segmented packet mbufs, and thus fast-free of jumbo frames is possible.
> 
> 
> Detailed Analysis:
> 
> The purpose of the mbuf fast-free Tx optimization is to reduce
> rte_pktmbuf_free_seg() to something much simpler in the ethdev drivers, by
> eliminating the code path related to indirect mbufs.
> Optimally, we want to simplify the ethdev driver's function that frees the
> transmitted mbufs, so it can free them directly to their mempool without
> accessing the mbufs themselves.
> 
> If the driver cannot access the mbuf itself, it cannot determine which
> mempool it belongs to.
> We don't want the driver to access every mbuf being freed; but if all
> mbufs of a Tx queue belong to the same mempool, the driver can determine
> which mempool by looking into just one of the mbufs.
> 
> REQUIREMENT 1: The mbufs of a Tx queue must come from the same mempool.
> 
> 
> When an mbuf is freed to its mempool, some of the fields in the mbuf must
> be initialized.
> So, for fast-free, this must be done by the driver's function that
> prepares the Tx descriptor.
> This is a requirement to the driver, not a requirement to the application.
> 
> Now, let's dig into the code for freeing an mbuf.
> Note: For readability purposes, I'll cut out some code and comments
> unrelated to this topic.
> 
> static __rte_always_inline void
> rte_pktmbuf_free_seg(struct rte_mbuf *m)
> {
>       m = rte_pktmbuf_prefree_seg(m);
>       if (likely(m != NULL))
>               rte_mbuf_raw_free(m);
> }
> 
> 
> rte_mbuf_raw_free(m) is simple, so nothing to gain there:
> 
> /**
>  * Put mbuf back into its original mempool.
>  *
>  * The caller must ensure that the mbuf is direct and properly
>  * reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by
>  * rte_pktmbuf_prefree_seg().
>  */
> static __rte_always_inline void
> rte_mbuf_raw_free(struct rte_mbuf *m)
> {
>       rte_mbuf_history_mark(m, RTE_MBUF_HISTORY_OP_LIB_FREE);
>       rte_mempool_put(m->pool, m);
> }
> 
> Note that the description says that the mbuf must be direct.
> This is not entirely accurate; the mbuf is allowed to use a pinned
> external buffer, if the mbuf holds the only reference to it.
> (Most of the mbuf library functions have this documentation inaccuracy,
> which should be fixed some day.)
> 
> So, the fast-free optimization really comes down to
> rte_pktmbuf_prefree_seg(m), which must not return NULL.
> 
> Let's dig into that.
> 
> /**
>  * Decrease reference counter and unlink a mbuf segment
>  *
>  * This function does the same than a free, except that it does not
>  * return the segment to its pool.
>  * It decreases the reference counter, and if it reaches 0, it is
>  * detached from its parent for an indirect mbuf.
>  *
>  * @return
>  *   - (m) if it is the last reference. It can be recycled or freed.
>  *   - (NULL) if the mbuf still has remaining references on it.
>  */
> static __rte_always_inline struct rte_mbuf *
> rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
>       bool refcnt_not_one;
> 
>       refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
>       if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
>               return NULL;
> 
>       if (unlikely(!RTE_MBUF_DIRECT(m))) {
>               rte_pktmbuf_detach(m);
>               if (RTE_MBUF_HAS_EXTBUF(m) &&
>                               RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
>                               __rte_pktmbuf_pinned_extbuf_decref(m))
>                       return NULL;
>       }
> 
>       if (refcnt_not_one)
>               rte_mbuf_refcnt_set(m, 1);
>       if (m->nb_segs != 1)
>               m->nb_segs = 1;
>       if (m->next != NULL)
>               m->next = NULL;
> 
>       return m;
> }
> 
> This function can only succeed (i.e. return non-NULL) when 'refcnt' is 1
> (or reaches 0).
> 
> REQUIREMENT 2: The driver must hold the only reference to the mbuf,
> i.e. 'm->refcnt' must be 1.
> 
> 
> When the function succeeds, it initializes the mbuf fields as required by
> rte_mbuf_raw_free() before returning.
> 
> Now, since the driver has exclusive access to the mbuf, it is free to
> initialize the 'm->next' and 'm->nb_segs' at any time.
> It could do that when preparing the Tx descriptor.
> 
> This is very interesting, because it means that fast-free does not
> prohibit segmented packets!
> (But the driver must have sufficient Tx descriptors for all segments in
> the mbuf.)
> 
> 
> Now, lets dig into rte_pktmbuf_prefree_seg()'s block handling non-direct
> mbufs, i.e. cloned mbufs and mbufs with external buffer:
> 
>       if (unlikely(!RTE_MBUF_DIRECT(m))) {
>               rte_pktmbuf_detach(m);
>               if (RTE_MBUF_HAS_EXTBUF(m) &&
>                               RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
>                               __rte_pktmbuf_pinned_extbuf_decref(m))
>                       return NULL;
>       }
> 
> Starting with rte_pktmbuf_detach():
> 
> static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
> {
>       struct rte_mempool *mp = m->pool;
>       uint32_t mbuf_size, buf_len;
>       uint16_t priv_size;
> 
>       if (RTE_MBUF_HAS_EXTBUF(m)) {
>               /*
>                * The mbuf has the external attached buffer,
>                * we should check the type of the memory pool where
>                * the mbuf was allocated from to detect the pinned
>                * external buffer.
>                */
>               uint32_t flags = rte_pktmbuf_priv_flags(mp);
> 
>               if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
>                       /*
>                        * The pinned external buffer should not be
>                        * detached from its backing mbuf, just exit.
>                        */
>                       return;
>               }
>               __rte_pktmbuf_free_extbuf(m);
>       } else {
>               __rte_pktmbuf_free_direct(m);
>       }
>       priv_size = rte_pktmbuf_priv_size(mp);
>       mbuf_size = (uint32_t)(sizeof(struct rte_mbuf) + priv_size);
>       buf_len = rte_pktmbuf_data_room_size(mp);
> 
>       m->priv_size = priv_size;
>       m->buf_addr = (char *)m + mbuf_size;
>       rte_mbuf_iova_set(m, rte_mempool_virt2iova(m) + mbuf_size);
>       m->buf_len = (uint16_t)buf_len;
>       rte_pktmbuf_reset_headroom(m);
>       m->data_len = 0;
>       m->ol_flags = 0;
> }
> 
> The only quick and simple code path through this function is when the mbuf
> uses a pinned external buffer:
>       if (RTE_MBUF_HAS_EXTBUF(m)) {
>               uint32_t flags = rte_pktmbuf_priv_flags(mp);
>               if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
>                       return;
> 
> REQUIREMENT 3: The mbuf must not be cloned or use a non-pinned external
> buffer.
> 
> 
> Continuing with the next part of rte_pktmbuf_prefree_seg()'s block:
>               if (RTE_MBUF_HAS_EXTBUF(m) &&
>                               RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
>                               __rte_pktmbuf_pinned_extbuf_decref(m))
>                       return NULL;
> 
> Continuing with the next part of the block in rte_pktmbuf_prefree_seg():
> 
> /**
>  * @internal Handle the packet mbufs with attached pinned external buffer
>  * on the mbuf freeing:
>  *
>  *  - return zero if reference counter in shinfo is one. It means there is
>  *  no more reference to this pinned buffer and mbuf can be returned to
>  *  the pool
>  *
>  *  - otherwise (if reference counter is not one), decrement reference
>  *  counter and return non-zero value to prevent freeing the backing mbuf.
>  *
>  * Returns non zero if mbuf should not be freed.
>  */
> static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
> {
>       struct rte_mbuf_ext_shared_info *shinfo;
> 
>       /* Clear flags, mbuf is being freed. */
>       m->ol_flags = RTE_MBUF_F_EXTERNAL;
>       shinfo = m->shinfo;
> 
>       /* Optimize for performance - do not dec/reinit */
>       if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1))
>               return 0;
> 
>       /*
>        * Direct usage of add primitive to avoid
>        * duplication of comparing with one.
>        */
>       if (likely(rte_atomic_fetch_add_explicit(&shinfo->refcnt, -1,
>                                    rte_memory_order_acq_rel) - 1))
>               return 1;
> 
>       /* Reinitialize counter before mbuf freeing. */
>       rte_mbuf_ext_refcnt_set(shinfo, 1);
>       return 0;
> }
> 
> Essentially, if the mbuf does use a pinned external buffer,
> rte_pktmbuf_prefree_seg() only succeeds if that pinned external buffer is
> only referred to by the mbuf.
> 
> REQUIREMENT 4: If the mbuf uses a pinned external buffer, the mbuf must
> hold the only reference to that pinned external buffer, i.e. in that case,
> 'm->shinfo->refcnt' must be 1.
> 
> 
> Please review.
> 
> If I'm not mistaken, the mbuf library is not a barrier for fast-freeing
> segmented packet mbufs, and thus fast-free of jumbo frames is possible.
> 
> We need a driver developer to confirm that my suggested approach -
> resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> preparing the Tx descriptor - is viable.

Great analysis, makes a lot of sense to me.
Shall we add then a special API to make PMD maintainers life a bit easier:
Something like rte_mbuf_fast_free_prep(mp, mb), that will optionally check
that requirements outlined  above are satisfied for given mbuf and
also reset mbuf fields to expected values?
Konstantin



Reply via email to