On Mon, Oct 24, 2016 at 11:47:16PM +0200, Morten Br?rup wrote:
> Comments inline.
> 
> Med venlig hilsen / kind regards
> - Morten Br?rup
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Monday, October 24, 2016 6:26 PM
> > To: Wiles, Keith
> > Cc: Morten Br?rup; dev at dpdk.org; Olivier Matz
> > Subject: Re: [dpdk-dev] mbuf changes
> > 
> > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote:
> > >
> > > > On Oct 24, 2016, at 10:49 AM, Morten Br?rup <mb at 
> > > > smartsharesystems.com>
> > wrote:
> > > >
> > > > First of all: Thanks for a great DPDK Userspace 2016!
> > > >
> > > >
> > > >
> > > > Continuing the Userspace discussion about Olivier Matz?s proposed mbuf
> > changes...
> > 
> > Thanks for keeping the discussion going!
> > > >
> > > >
> > > >
> > > > 1.
> > > >
> > > > Stephen Hemminger had a noteworthy general comment about keeping
> > metadata for the NIC in the appropriate section of the mbuf: Metadata
> > generated by the NIC?s RX handler belongs in the first cache line, and
> > metadata required by the NIC?s TX handler belongs in the second cache line.
> > This also means that touching the second cache line on ingress should be
> > avoided if possible; and Bruce Richardson mentioned that for this reason m-
> > >next was zeroed on free().
> > > >
> > Thinking about it, I suspect there are more fields we can reset on free
> > to save time on alloc. Refcnt, as discussed below is one of them, but so
> > too could be the nb_segs field and possibly others.
> Yes. Consider the use of rte_pktmbuf_reset() or add a rte_pktmbuf_prealloc() 
> for this purpose.
> 
Possibly. We just want to make sure that we don't reset too much either!
:-)

> 
> > > > 2.
> > > >
> > > > There seemed to be consensus that the size of m->refcnt should match
> > the size of m->port because a packet could be duplicated on all physical
> > ports for L3 multicast and L2 flooding.
> > > >
> > > > Furthermore, although a single physical machine (i.e. a single server)
> > with 255 physical ports probably doesn?t exist, it might contain more than
> > 255 virtual machines with a virtual port each, so it makes sense extending
> > these mbuf fields from 8 to 16 bits.
> > >
> > > I thought we also talked about removing the m->port from the mbuf as it
> > is not really needed.
> > >
> > Yes, this was mentioned, and also the option of moving the port value to
> > the second cacheline, but it appears that NXP are using the port value
> > in their NIC drivers for passing in metadata, so we'd need their
> > agreement on any move (or removal).
> > 
> If a single driver instance services multiple ports, so the ports are not 
> polled individually, the m->port member will be required to identify the 
> port. In that case, it might also used elsewhere in the ingress path, and 
> thus appropriate having in the first cache line. So yes, this needs further 
> discussion with NXP.
> 
> 
> > > > 3.
> > > >
> > > > Someone (Bruce Richardson?) suggested moving m->refcnt and m->port to
> > the second cache line, which then generated questions from the audience
> > about the real life purpose of m->port, and if m->port could be removed
> > from the mbuf structure.
> > > >
> > > >
> > > >
> > > > 4.
> > > >
> > > > I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 on
> > first allocation. This is based on the assumption that other mbuf fields
> > must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is cheaper
> > than setting it to 1.
> > > >
> > > > Furthermore (regardless of m->refcnt offset), I suggested that it is
> > not required to modify m->refcnt when allocating and freeing the mbuf, thus
> > saving one write operation on both alloc() and free(). However, this
> > assumes that m->refcnt debugging, e.g. underrun detection, is not required.
> > 
> > I don't think it really matters what sentinal value is used for the
> > refcnt because it can't be blindly assigned on free like other fields.
> > However, I think 0 as first reference value becomes more awkward
> > than 1, because we need to deal with underflow. Consider the situation
> > where we have two references to the mbuf, so refcnt is 1, and both are
> > freed at the same time. Since the refcnt is not-zero, then both cores
> > will do an atomic decrement simultaneously giving a refcnt of -1. We can
> > then set this back to zero before freeing, however, I'd still prefer to
> > have refcnt be an accurate value so that it always stays positive, and
> > we can still set it to "one" on free to avoid having to set on alloc.
> > 
> Good example. It explains very clearly why my suggested optimization is 
> tricky. My optimization is targeting the most likely scenario where m->refcnt 
> is only "one", but at the expense of some added complexity for the unlikely 
> scenarios. (I only suggested 0 as "one" because I assumed the value could be 
> zeroed faster than set to 1, so forget about that, if you like.)
> 
> I still argue that if we keep m->refcnt at "one" value while in the free 
> pool, a write operation can be avoided at both free() and alloc(). Here is a 
> conceptual example (using m->refcnt==1 for "one"):
> 
Yes, I agree with you on this.

> void __rte_mbuf_raw_free(struct rte_mbuf *m)
> {
>       RTE_ASSERT(rte_mbuf_refcnt_read(m) == 1);
>       rte_mempool_put(m->pool, m);
> }
> 
> void __rte_pktmbuf_free_seg_inner(struct rte_mbuf *m)
> {
>       /* if this is an indirect mbuf, it is detached. */
>       if (RTE_MBUF_INDIRECT(m))
>               rte_pktmbuf_detach(m);
>       m->next = NULL; /* Preload to avoid setting in alloc(). */
>       __rte_mbuf_raw_free(m);
> }
> 
> void rte_pktmbuf_free_seg(struct rte_mbuf *m)
> {
>       if (likely(rte_mbuf_refcnt_read(m) == 1)) {
>               /* We have the last reference, so we return it to the free 
> pool. */
>               __rte_pktmbuf_free_seg_inner(m);
>       } else {
>               /* More references, so decrement the refcnt. */
>               if (unlikely(rte_mbuf_refcnt_update(m, -1) == 0)) {
>                       /* Someone raced to decrement the refcnt before us,
>                          so now we have the last reference,
>                          so we return it to the free pool. */
>                   rte_mbuf_refcnt_set(m, 1); /* Preload to avoid in alloc(). 
> */
>                       __rte_pktmbuf_free_seg_inner(m);
>               }
>       }
> }
> 
> 
> > Also, if we set refcnt on free rather than alloc, it does set itself up
> > as a good candidate for moving to the second cacheline. Fast-path
> > processing does not normally update the value.
> > 
> Agreed. But also consider that TX handling should be optimized too. And this 
> includes free() processing and preloading values for alloc(); this should 
> only be done in free() instead of alloc() if it provides a total performance 
> improvement. We should not improve RX handling performance if the TX handling 
> performance decreases much more.

Agreed.
> 
> 
> > > > 5.
> > > >
> > > > And here?s something new to think about:
> > > >
> > > > m->next already reveals if there are more segments to a packet. Which
> > purpose does m->nb_segs serve that is not already covered by m->next?
> > 
> > It is duplicate info, but nb_segs can be used to check the validity of
> > the next pointer without having to read the second mbuf cacheline.
> > 
> > Whether it's worth having is something I'm happy enough to discuss,
> > though.
> If anybody needs to check for additional segments (m->next != NULL) without 
> hitting the second cache line, a single bit in the first cache line will 
> suffice. (If this is a common case, they probably need to read the m->next 
> value anyway, and then it would optimally belong in the first cache line. I 
> am just mentioning this. It's not my preference.)
> 
> 
> > One other point I'll mention is that we need to have a discussion on
> > how/where to add in a timestamp value into the mbuf. Personally, I think
> > it can be in a union with the sequence number value, but I also suspect
> > that 32-bits of a timestamp is not going to be enough for many.
> > 
> > Thoughts?
> Speaking for myself, a high resolution RX timestamp is "nice to have", so I 
> like it, but I would hate to sacrifice the sequence number for it. Generally, 
> I would assume that the sequence number used for reordering is more important 
> than a timestamp, except for applications dealing with timing measurements or 
> high precision packet capturing.
> 
> Conversion of timestamps from one unit to another can be relatively 
> expensive, so I recommend keeping timestamps in whatever resolution the NIC 
> hardware provides, leaving the conversion to e.g. nanoseconds up to the 
> applications that use timestamps. A lot of protocols use opaque timestamps, 
> e.g. RTP, so this would not be exceptional. Since the timestamp belongs in 
> the first cache line, I suggest we limit it to 32 bit. This can be the least 
> significant 32 bit of a 64 bit NIC hardware timestamp, leaving it up to the 
> application to keep track of the most significant 32 bit by some other means. 
> If it's counting nanoseconds, 32 bit can hold 4 seconds.
> 

I worry that 32-bits is not enough. If we do use a 32-bit value, I also
think that in many cases we can't just take the low 32-bits. For a 2GHz
IA CPU the rdtsc() value would wrap around in just 2 seconds if
truncated to 32 bits, for instance. In a case like that, we may be
better to take 32-bits from somewhere in the middle - trading accuracy
for better wrap-around time.

/Bruce

Reply via email to