Comments inline. Med venlig hilsen / kind regards - Morten Br?rup
> -----Original Message----- > From: Shreyansh Jain [mailto:shreyansh.jain at nxp.com] > Sent: Tuesday, October 25, 2016 1:54 PM > To: Bruce Richardson > Cc: Wiles, Keith; Morten Br?rup; dev at dpdk.org; Olivier Matz > Subject: Re: mbuf changes > > On Monday 24 October 2016 09:55 PM, Bruce Richardson wrote: > > On Mon, Oct 24, 2016 at 04:11:33PM +0000, Wiles, Keith wrote: > >> > >>> On Oct 24, 2016, at 10:49 AM, Morten Br?rup > <mb at smartsharesystems.com> wrote: > >>> > >>> First of all: Thanks for a great DPDK Userspace 2016! > >>> > >>> > >>> > >>> Continuing the Userspace discussion about Olivier Matz?s proposed > mbuf changes... > > > > Thanks for keeping the discussion going! > >>> > >>> > >>> > >>> 1. > >>> > >>> Stephen Hemminger had a noteworthy general comment about keeping > metadata for the NIC in the appropriate section of the mbuf: Metadata > generated by the NIC?s RX handler belongs in the first cache line, and > metadata required by the NIC?s TX handler belongs in the second cache > line. This also means that touching the second cache line on ingress > should be avoided if possible; and Bruce Richardson mentioned that for > this reason m->next was zeroed on free(). > >>> > > Thinking about it, I suspect there are more fields we can reset on > free > > to save time on alloc. Refcnt, as discussed below is one of them, but > so > > too could be the nb_segs field and possibly others. > > > >>> > >>> > >>> 2. > >>> > >>> There seemed to be consensus that the size of m->refcnt should > match the size of m->port because a packet could be duplicated on all > physical ports for L3 multicast and L2 flooding. > >>> > >>> Furthermore, although a single physical machine (i.e. a single > server) with 255 physical ports probably doesn?t exist, it might > contain more than 255 virtual machines with a virtual port each, so it > makes sense extending these mbuf fields from 8 to 16 bits. > >> > >> I thought we also talked about removing the m->port from the mbuf as > it is not really needed. > >> > > Yes, this was mentioned, and also the option of moving the port value > to > > the second cacheline, but it appears that NXP are using the port > value > > in their NIC drivers for passing in metadata, so we'd need their > > agreement on any move (or removal). > > I am not sure where NXP's NIC came into picture on this, but now that > it > is highlighted, this field is required for libevent implementation [1]. > > A scheduler sending an event, which can be a packet, would only have > information of a flow_id. From this matching it back to a port, without > mbuf->port, would be very difficult (costly). There may be way around > this but at least in current proposal I think port would be important > to > have - even if in second cache line. > Since it's a natural part of the RX handler, it rightfully belongs in the first cache line. Admitting that I haven't read the referenced proposal in detail, I would like to reiterate that perhaps the Flow Director fields in the mbuf could be used instead of the port field. > But, off the top of my head, as of now it is not being used for any > specific purpose in NXP's PMD implementation. > > Even the SoC patches don't necessarily rely on it except using it > because it is available. > > @Bruce: where did you get the NXP context here from? > > [1] http://dpdk.org/ml/archives/dev/2016-October/048592.html > > > > >>> > >>> > >>> > >>> 3. > >>> > >>> Someone (Bruce Richardson?) suggested moving m->refcnt and m->port > to the second cache line, which then generated questions from the > audience about the real life purpose of m->port, and if m->port could > be removed from the mbuf structure. > >>> > >>> > >>> > >>> 4. > >>> > >>> I suggested using offset -1 for m->refcnt, so m->refcnt becomes 0 > on first allocation. This is based on the assumption that other mbuf > fields must be zeroed at alloc()/free() anyway, so zeroing m->refcnt is > cheaper than setting it to 1. > >>> > >>> Furthermore (regardless of m->refcnt offset), I suggested that it > is not required to modify m->refcnt when allocating and freeing the > mbuf, thus saving one write operation on both alloc() and free(). > However, this assumes that m->refcnt debugging, e.g. underrun > detection, is not required. > > > > I don't think it really matters what sentinal value is used for the > > refcnt because it can't be blindly assigned on free like other > fields. > > However, I think 0 as first reference value becomes more awkward > > than 1, because we need to deal with underflow. Consider the > situation > > where we have two references to the mbuf, so refcnt is 1, and both > are > > freed at the same time. Since the refcnt is not-zero, then both cores > > will do an atomic decrement simultaneously giving a refcnt of -1. We > can > > then set this back to zero before freeing, however, I'd still prefer > to > > have refcnt be an accurate value so that it always stays positive, > and > > we can still set it to "one" on free to avoid having to set on alloc. > > > > Also, if we set refcnt on free rather than alloc, it does set itself > up > > as a good candidate for moving to the second cacheline. Fast-path > > processing does not normally update the value. > > > >>> > >>> > >>> > >>> 5. > >>> > >>> And here?s something new to think about: > >>> > >>> m->next already reveals if there are more segments to a packet. > Which purpose does m->nb_segs serve that is not already covered by m- > >next? > > > > It is duplicate info, but nb_segs can be used to check the validity > of > > the next pointer without having to read the second mbuf cacheline. > > > > Whether it's worth having is something I'm happy enough to discuss, > > though. > > > > One other point I'll mention is that we need to have a discussion on > > how/where to add in a timestamp value into the mbuf. Personally, I > think > > it can be in a union with the sequence number value, but I also > suspect > > that 32-bits of a timestamp is not going to be enough for many. > > > > Thoughts? > > > > /Bruce > > > > > -- > - > Shreyansh