On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Br?rup wrote: > Comments inline. > > Med venlig hilsen / kind regards > - Morten Br?rup > > > -----Original Message----- > > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson > > Sent: Monday, October 24, 2016 6:26 PM > > To: Wiles, Keith > > Cc: Morten Br?rup; dev at dpdk.org; Olivier Matz > > Subject: Re: [dpdk-dev] mbuf changes > > > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote: > > > > > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup <mb at > > > > smartsharesystems.com> > > wrote: > > > > > > > > First of all: Thanks for a great DPDK Userspace 2016! > > > > > > > > > > > > > > > > Continuing the Userspace discussion about Olivier Matz?s proposed mbuf > > changes... > > > > Thanks for keeping the discussion going! > > > > > > > > > > > > > > > > 1. > > > > > > > > Stephen Hemminger had a noteworthy general comment about keeping > > metadata for the NIC in the appropriate section of the mbuf: Metadata > > generated by the NIC?s RX handler belongs in the first cache line, and > > metadata required by the NIC?s TX handler belongs in the second cache line. > > This also means that touching the second cache line on ingress should be > > avoided if possible; and Bruce Richardson mentioned that for this reason m- > > >next was zeroed on free(). > > > > > > Thinking about it, I suspect there are more fields we can reset on free > > to save time on alloc. Refcnt, as discussed below is one of them, but so > > too could be the nb_segs field and possibly others. > Yes. Consider the use of rte_pktmbuf_reset() or add a rte_pktmbuf_prealloc() > for this purpose. > Possibly. We just want to make sure that we don't reset too much either! :-)
> > > > > 2. > > > > > > > > There seemed to be consensus that the size of m->refcnt should match > > the size of m->port because a packet could be duplicated on all physical > > ports for L3 multicast and L2 flooding. > > > > > > > > Furthermore, although a single physical machine (i.e. a single server) > > with 255 physical ports probably doesn?t exist, it might contain more than > > 255 virtual machines with a virtual port each, so it makes sense extending > > these mbuf fields from 8 to 16 bits. > > > > > > I thought we also talked about removing the m->port from the mbuf as it > > is not really needed. > > > > > Yes, this was mentioned, and also the option of moving the port value to > > the second cacheline, but it appears that NXP are using the port value > > in their NIC drivers for passing in metadata, so we'd need their > > agreement on any move (or removal). > > > If a single driver instance services multiple ports, so the ports are not > polled individually, the m->port member will be required to identify the > port. In that case, it might also used elsewhere in the ingress path, and > thus appropriate having in the first cache line. So yes, this needs further > discussion with NXP. > > > > > > 3. > > > > > > > > Someone (Bruce Richardson?) suggested moving m->refcnt and m->port to > > the second cache line, which then generated questions from the audience > > about the real life purpose of m->port, and if m->port could be removed > > from the mbuf structure. > > > > > > > > > > > > > > > > 4. > > > > > > > > I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 on > > first allocation. This is based on the assumption that other mbuf fields > > must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is cheaper > > than setting it to 1. > > > > > > > > Furthermore (regardless of m->refcnt offset), I suggested that it is > > not required to modify m->refcnt when allocating and freeing the mbuf, thus > > saving one write operation on both alloc() and free(). However, this > > assumes that m->refcnt debugging, e.g. underrun detection, is not required. > > > > I don't think it really matters what sentinal value is used for the > > refcnt because it can't be blindly assigned on free like other fields. > > However, I think 0 as first reference value becomes more awkward > > than 1, because we need to deal with underflow. Consider the situation > > where we have two references to the mbuf, so refcnt is 1, and both are > > freed at the same time. Since the refcnt is not-zero, then both cores > > will do an atomic decrement simultaneously giving a refcnt of -1. We can > > then set this back to zero before freeing, however, I'd still prefer to > > have refcnt be an accurate value so that it always stays positive, and > > we can still set it to "one" on free to avoid having to set on alloc. > > > Good example. It explains very clearly why my suggested optimization is > tricky. My optimization is targeting the most likely scenario where m->refcnt > is only "one", but at the expense of some added complexity for the unlikely > scenarios. (I only suggested 0 as "one" because I assumed the value could be > zeroed faster than set to 1, so forget about that, if you like.) > > I still argue that if we keep m->refcnt at "one" value while in the free > pool, a write operation can be avoided at both free() and alloc(). Here is a > conceptual example (using m->refcnt==1 for "one"): > Yes, I agree with you on this. > void __rte_mbuf_raw_free(struct rte_mbuf *m) > { > RTE_ASSERT(rte_mbuf_refcnt_read(m) == 1); > rte_mempool_put(m->pool, m); > } > > void __rte_pktmbuf_free_seg_inner(struct rte_mbuf *m) > { > /* if this is an indirect mbuf, it is detached. */ > if (RTE_MBUF_INDIRECT(m)) > rte_pktmbuf_detach(m); > m->next = NULL; /* Preload to avoid setting in alloc(). */ > __rte_mbuf_raw_free(m); > } > > void rte_pktmbuf_free_seg(struct rte_mbuf *m) > { > if (likely(rte_mbuf_refcnt_read(m) == 1)) { > /* We have the last reference, so we return it to the free > pool. */ > __rte_pktmbuf_free_seg_inner(m); > } else { > /* More references, so decrement the refcnt. */ > if (unlikely(rte_mbuf_refcnt_update(m, -1) == 0)) { > /* Someone raced to decrement the refcnt before us, > so now we have the last reference, > so we return it to the free pool. */ > rte_mbuf_refcnt_set(m, 1); /* Preload to avoid in alloc(). > */ > __rte_pktmbuf_free_seg_inner(m); > } > } > } > > > > Also, if we set refcnt on free rather than alloc, it does set itself up > > as a good candidate for moving to the second cacheline. Fast-path > > processing does not normally update the value. > > > Agreed. But also consider that TX handling should be optimized too. And this > includes free() processing and preloading values for alloc(); this should > only be done in free() instead of alloc() if it provides a total performance > improvement. We should not improve RX handling performance if the TX handling > performance decreases much more. Agreed. > > > > > > 5. > > > > > > > > And here?s something new to think about: > > > > > > > > m->next already reveals if there are more segments to a packet. Which > > purpose does m->nb_segs serve that is not already covered by m->next? > > > > It is duplicate info, but nb_segs can be used to check the validity of > > the next pointer without having to read the second mbuf cacheline. > > > > Whether it's worth having is something I'm happy enough to discuss, > > though. > If anybody needs to check for additional segments (m->next != NULL) without > hitting the second cache line, a single bit in the first cache line will > suffice. (If this is a common case, they probably need to read the m->next > value anyway, and then it would optimally belong in the first cache line. I > am just mentioning this. It's not my preference.) > > > > One other point I'll mention is that we need to have a discussion on > > how/where to add in a timestamp value into the mbuf. Personally, I think > > it can be in a union with the sequence number value, but I also suspect > > that 32-bits of a timestamp is not going to be enough for many. > > > > Thoughts? > Speaking for myself, a high resolution RX timestamp is "nice to have", so I > like it, but I would hate to sacrifice the sequence number for it. Generally, > I would assume that the sequence number used for reordering is more important > than a timestamp, except for applications dealing with timing measurements or > high precision packet capturing. > > Conversion of timestamps from one unit to another can be relatively > expensive, so I recommend keeping timestamps in whatever resolution the NIC > hardware provides, leaving the conversion to e.g. nanoseconds up to the > applications that use timestamps. A lot of protocols use opaque timestamps, > e.g. RTP, so this would not be exceptional. Since the timestamp belongs in > the first cache line, I suggest we limit it to 32 bit. This can be the least > significant 32 bit of a 64 bit NIC hardware timestamp, leaving it up to the > application to keep track of the most significant 32 bit by some other means. > If it's counting nanoseconds, 32 bit can hold 4 seconds. > I worry that 32-bits is not enough. If we do use a 32-bit value, I also think that in many cases we can't just take the low 32-bits. For a 2GHz IA CPU the rdtsc() value would wrap around in just 2 seconds if truncated to 32 bits, for instance. In a case like that, we may be better to take 32-bits from somewhere in the middle - trading accuracy for better wrap-around time. /Bruce