On Thu, Apr 26, 2018 at 07:05:01PM +0300, Andrew Rybchenko wrote: > On 04/26/2018 04:10 AM, Yongseok Koh wrote: > > This patch introduces a new way of attaching an external buffer to a mbuf. > > > > Attaching an external buffer is quite similar to mbuf indirection in > > replacing buffer addresses and length of a mbuf, but a few differences: > > - When an indirect mbuf is attached, refcnt of the direct mbuf would be > > 2 as long as the direct mbuf itself isn't freed after the attachment. > > In such cases, the buffer area of a direct mbuf must be read-only. But > > external buffer has its own refcnt and it starts from 1. Unless > > multiple mbufs are attached to a mbuf having an external buffer, the > > external buffer is writable. > > - There's no need to allocate buffer from a mempool. Any buffer can be > > attached with appropriate free callback. > > - Smaller metadata is required to maintain shared data such as refcnt. > > I'm still unsure how reference counters for external buffer work. > > Let's consider the following example: > > |<-mbuf1 buf_len-->|<-mbuf2 buf_len-->| > +--+-----------+----+--------------+----+--------------+---+- - - > | global |head|mbuf1 data |head|mbuf2 data | | > | shinfo=2 |room| |room| | | > +--+-----------+----+--------------+----+--------------+---+- - - > ^ ^ > +----------+ | +----------+ | > | mbuf1 +---+ | mbuf2 +-+ > | refcnt=1 | | refcnt=1 | > +----------+ +----------+ > > I.e. we have big buffer which is sliced into many small data > buffers referenced from mbufs. > > shinfo reference counter is used to control when big buffer > may be freed. But what controls sharing of each block? > > headroom and following mbuf data (buf_len) is owned by > corresponding mbuf and the mbuf owner can do everything > with the space (prepend data, append data, modify etc). > I.e. it is read-write in above terminology. > > What should happen if mbuf1 is cloned? Right now it will result > in a new mbuf1a with reference counter 1 and incremented shinfo > reference counter. And what does say that corresponding area > is read-only now? It looks like nothing. > > As I understand it should be solved using per data area shinfo > which free callback decrements big buffer reference counter.
I have to admit that I was confused at the moment and I mixed two different use-cases. 1) Transmitting a large storage block. |<--mbuf1 buf_len-->|<--mbuf2 buf_len-->| +--+-----------+----+--------------+----+--------------+---+- - - | global |head|mbuf1 data |head|mbuf2 data | | | shinfo=2 |room| |room| | | +--+-----------+----+--------------+----+--------------+---+- - - ^ ^ +----------+ | +----------+ | | mbuf1 +---+ | mbuf2 +-+ | refcnt=1 | | refcnt=1 | +----------+ +----------+ ^ ^ |next |next +-----+----+ +----------+ | mbuf1_hdr| | mbuf2_hdr| | refcnt=1 | | refcnt=1 | +----------+ +----------+ Yes, in this case, the large external buffer should always be read-only. And necessary network headers should be linked via m->next. Owners of m1 or m2 can't alter any bit in the external buffer because shinfo->refcnt > 1. 2) Slicing a large buffer and provide r-w buffers. |<--mbuf1 buf_len-->| |<--mbuf2 buf_len-->| +--+-----------+----+--------------+------+----+--------------+------+----+- - - | user data |head|mbuf1 data |shinfo|head|mbuf2 data |shinfo| | | refc=2 |room| |refc=1|room| |refc=1| | +--+-----------+----+--------------+------+----+--------------+------+----+- - - ^ ^ +----------+ | +----------+ | | mbuf1 +---+ | mbuf2 +-+ | refcnt=1 | | refcnt=1 | +----------+ +----------+ Here, the user data for the large chunk isn't rte_mbuf_ext_shared_info but a custom structure managed by user in order to free the whole chunk. free_cb would decrement a custom refcnt in custom way. But librte_mbuf doesn't need to be aware of it. It is user's responsibility. The library is just responsible for calling free_cb when shinfo->refcnt gets to zero. > So, we have two reference counters per each mbuf with external > buffer (plus reference counter per big buffer). > Two reference counters sounds too much and it looks like > mbuf-with-extbuf reference counter is not really used > (since on clone/attach update shinfo refcnt). > It is still two counters to check on free. Each refcnt implies whether it is r or r-w. Even for direct mbuf, if two users are accessing it, refcnt is 2 and it is read-only. This should mean both mbuf metadata and its data area are all read-only. Users can alter neither various length fields nor packet data, for example. For non-direct mbufs, still its refcnt should be used, but refcnt only represents the metadata is shared and read-only if it is more than 1. So, refcnt of mbuf-with-extbuf is still used. Incrementing refcnt means an entity acquired access to the object, including cases of attachment (indirec/extbuf). > Have you considered alternative approach to use mbuf refcnt > as sharing indicator for extbuf data? However, in this case > indirect referencing extbuf would logically look like: > > +----------+ +--------+ +--------+ > | indirect +--->| extbuf +---->| data | > | mbuf | | mbuf | | | > +----------+ +--------+ +--------+ > > It looks like it would allow to avoid two reference counters > per data block as above. Personally I'm not sure which approach > is better and would like to hear what you and other reviewers > think about it. So, I still think this patch is okay. > Some minor notes below as well. > > > Signed-off-by: Yongseok Koh <ys...@mellanox.com> > > --- > > > > ** This patch can pass the mbuf_autotest. ** > > > > Submitting only non-mlx5 patches to meet deadline for RC1. mlx5 patches > > will be submitted separately rebased on a differnet patchset which > > accommodates new memory hotplug design to mlx PMDs. > > > > v6: > > * rte_pktmbuf_attach_extbuf() doesn't take NULL shinfo. Instead, > > rte_pktmbuf_ext_shinfo_init_helper() is added. > > * bug fix in rte_pktmbuf_attach() - shinfo wasn't saved to mi. > > * minor changes from review. > > > > v5: > > * rte_pktmbuf_attach_extbuf() sets headroom to 0. > > * if shinfo is provided when attaching, user should initialize it. > > * minor changes from review. > > > > v4: > > * rte_pktmbuf_attach_extbuf() takes new arguments - buf_iova and shinfo. > > user can pass memory for shared data via shinfo argument. > > * minor changes from review. > > > > v3: > > * implement external buffer attachment instead of introducing buf_off for > > mbuf indirection. > > > > lib/librte_mbuf/rte_mbuf.h | 335 > > +++++++++++++++++++++++++++++++++++++++++---- > > 1 file changed, 306 insertions(+), 29 deletions(-) > > > > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h > > index 43aaa9c5f..0a6885281 100644 > > --- a/lib/librte_mbuf/rte_mbuf.h > > +++ b/lib/librte_mbuf/rte_mbuf.h [...] > > /** Mbuf prefetch */ > > #define RTE_MBUF_PREFETCH_TO_FREE(m) do { \ > > if ((m) != NULL) \ > > @@ -1213,11 +1307,157 @@ static inline int rte_pktmbuf_alloc_bulk(struct > > rte_mempool *pool, > > } > > /** > > + * Initialize shared data at the end of an external buffer before attaching > > + * to a mbuf by ``rte_pktmbuf_attach_extbuf()``. This is not a mandatory > > + * initialization but a helper function to simply spare a few bytes at the > > + * end of the buffer for shared data. If shared data is allocated > > + * separately, this should not be called but application has to properly > > + * initialize the shared data according to its need. > > + * > > + * Free callback and its argument is saved and the refcnt is set to 1. > > + * > > + * @warning > > + * buf_len must be adjusted to RTE_PTR_DIFF(shinfo, buf_addr) after this > > + * initialization. For example, > > May be buf_len should be inout and it should be done by the function? > Just a question since current approach looks fragile. Yeah, I thought about that but I didn't want to alter user's variable, I thought it could be error-prone. Anyway either way is okay to me. Will wait for a day to get input because I will send out a new version (hopefully last :-) to fix the nit you mentioned below. > > + * > > + * struct rte_mbuf_ext_shared_info *shinfo = > > + * rte_pktmbuf_ext_shinfo_init_helpfer(buf_addr, buf_len, > > + * free_cb, fcb_arg); > > + * buf_len = RTE_PTR_DIFF(shinfo, buf_addr); > > + * rte_pktmbuf_attach_extbuf(m, buf_addr, buf_iova, buf_len, shinfo); > > + * > > + * @param buf_addr > > + * The pointer to the external buffer. > > + * @param buf_len > > + * The size of the external buffer. buf_len must be larger than the size > > + * of ``struct rte_mbuf_ext_shared_info`` and padding for alignment. If > > + * not enough, this function will return NULL. > > + * @param free_cb > > + * Free callback function to call when the external buffer needs to be > > + * freed. > > + * @param fcb_opaque > > + * Argument for the free callback function. > > + * > > + * @return > > + * A pointer to the initialized shared data on success, return NULL > > + * otherwise. > > + */ > > +static inline struct rte_mbuf_ext_shared_info * > > +rte_pktmbuf_ext_shinfo_init_helper(void *buf_addr, uint16_t buf_len, > > + rte_mbuf_extbuf_free_callback_t free_cb, void *fcb_opaque) > > +{ > > + struct rte_mbuf_ext_shared_info *shinfo; > > + void *buf_end = RTE_PTR_ADD(buf_addr, buf_len); > > + > > + shinfo = RTE_PTR_ALIGN_FLOOR(RTE_PTR_SUB(buf_end, > > + sizeof(*shinfo)), sizeof(uintptr_t)); > > + if ((void *)shinfo <= buf_addr) > > + return NULL; > > + > > + rte_mbuf_ext_refcnt_set(shinfo, 1); > > + > > + shinfo->free_cb = free_cb; > > + shinfo->fcb_opaque = fcb_opaque; > > Just a nit, but I'd suggest to initialize in the same order as in the > struct. > (if there is no reasons why reference counter should be initialized first) Will do. Thanks, Yongseok