> From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Slava Ovsiienko
> Sent: Tuesday, November 3, 2020 3:03 PM
> 
> Hi, Morten
> 
> > From: Morten Brørup <m...@smartsharesystems.com>
> > Sent: Tuesday, November 3, 2020 14:10
> >
> > > From: Thomas Monjalon [mailto:tho...@monjalon.net]
> > > Sent: Monday, November 2, 2020 4:58 PM
> > >
> > > +Cc techboard
> > >
> > > We need benchmark numbers in order to take a decision.
> > > Please all, prepare some arguments and numbers so we can discuss
> the
> > > mbuf layout in the next techboard meeting.
> >
> > I propose that the techboard considers this from two angels:
> >
> > 1. Long term goals and their relative priority. I.e. what can be
> achieved with
> > wide-ranging modifications, requiring yet another ABI break and due
> notices.
> >
> > 2. Short term goals, i.e. what can be achieved for this release.
> >
> >
> > My suggestions follow...
> >
> > 1. Regarding long term goals:
> >
> > I have argued that simple forwarding of non-segmented packets using
> only the
> > first mbuf cache line can be achieved by making three
> > modifications:
> >
> > a) Move m->tx_offload to the first cache line.
> Not all PMDs use this field on Tx. HW might support the checksum
> offloads
> directly, not requiring these fields at all.
> 
> 
> > b) Use an 8 bit pktmbuf mempool index in the first cache line,
> >    instead of the 64 bit m->pool pointer in the second cache line.
> 256 mpool looks enough, as for me. Regarding the indirect access to the
> pool
> (via some table) - it might introduce some performance impact.

It might, but I hope that it is negligible, so the benefits outweigh the 
disadvantages.

It would have to be measured, though.

And m->pool is only used for free()'ing (and detach()'ing) mbufs.

> For example,
> mlx5 PMD strongly relies on pool field for allocating mbufs in Rx
> datapath.
> We're going to update (o-o, we found point to optimize), but for now it
> does.

Without looking at the source code, I don't think the PMD is using m->pool in 
the RX datapath, I think it is using a pool dedicated to a receive queue used 
for RX descriptors in the PMD (i.e. driver->queue->pool).

> 
> > c) Do not access m->next when we know that it is NULL.
> >    We can use m->nb_segs == 1 or some other invariant as the gate.
> >    It can be implemented by adding an m->next accessor function:
> >    struct rte_mbuf * rte_mbuf_next(struct rte_mbuf * m)
> >    {
> >        return m->nb_segs == 1 ? NULL : m->next;
> >    }
> 
> Sorry, not sure about this. IIRC, nb_segs is valid in the first
> segment/mbuf  only.
> If we have the 4 segments in the pkt we see nb_seg=4 in the first one,
> and the nb_seg=1
> in the others. The next field is NULL in the last mbuf only. Am I wrong
> and miss something ?

You are correct.

This would have to be updated too. Either by increasing m->nb_seg in the 
following segments, or by splitting up relevant functions into functions for 
working on first segments (incl. non-segmented packets), and functions for 
working on following segments of segmented packets.

> 
> > Regarding the priority of this goal, I guess that simple forwarding
> of non-
> > segmented packets is probably the path taken by the majority of
> packets
> > handled by DPDK.
> >
> > An alternative goal could be:
> > Do not touch the second cache line during RX.
> > A comment in the mbuf structure says so, but it is not true anymore.
> >
> > (I guess that regression testing didn't catch this because the tests
> perform TX
> > immediately after RX, so the cache miss just moves from the TX to the
> RX part
> > of the test application.)
> >
> >
> > 2. Regarding short term goals:
> >
> > The current DPDK source code looks to me like m->next is the most
> frequently
> > accessed field in the second cache line, so it makes sense moving
> this to the
> > first cache line, rather than m->pool.
> > Benchmarking may help here.
> 
> Moreover, for the segmented packets the packet size is supposed to be
> large,
> and it imposes the relatively low packet rate, so probably optimization
> of
> moving next to the 1st cache line might be negligible at all. Just
> compare 148Mpps of
> 64B pkts and 4Mpps of 3000B pkts over 100Gbps link. Currently we are on
> benchmarking
> and did not succeed yet on difference finding. The benefit can't be
> expressed in mpps delta,
> we should measure CPU clocks, but Rx queue is almost always empty - we
> have an empty
> loops. So, if we have the boost - it is extremely hard to catch one.

Very good point regarding the value of such an optimization, Slava!

And when free()'ing packets, both m->next and m->pool are touched.

So perhaps the free()/detach() functions in the mbuf library can be modified to 
handle first segments (and non-segmented packets) and following segments 
differently, so accessing m->next can be avoided for non-segmented packets. 
Then m->pool should be moved to the first cache line.

Reply via email to