Hi, Morten

> -----Original Message-----
> From: Morten Brørup <m...@smartsharesystems.com>
> Sent: Tuesday, November 3, 2020 14:10
> To: NBU-Contact-Thomas Monjalon <tho...@monjalon.net>; dev@dpdk.org;
> techbo...@dpdk.org
> Cc: Ajit Khaparde <ajit.khapa...@broadcom.com>; Ananyev, Konstantin
> <konstantin.anan...@intel.com>; Andrew Rybchenko
> <andrew.rybche...@oktetlabs.ru>; dev@dpdk.org; Yigit, Ferruh
> <ferruh.yi...@intel.com>; david.march...@redhat.com; Richardson, Bruce
> <bruce.richard...@intel.com>; olivier.m...@6wind.com; jer...@marvell.com;
> Slava Ovsiienko <viachesl...@nvidia.com>; honnappa.nagaraha...@arm.com;
> maxime.coque...@redhat.com; step...@networkplumber.org;
> hemant.agra...@nxp.com; Slava Ovsiienko <viachesl...@nvidia.com>; Matan
> Azrad <ma...@nvidia.com>; Shahaf Shuler <shah...@nvidia.com>
> Subject: RE: [dpdk-dev] [PATCH 15/15] mbuf: move pool pointer in hotterfirst
> half
> 
> > From: Thomas Monjalon [mailto:tho...@monjalon.net]
> > Sent: Monday, November 2, 2020 4:58 PM
> >
> > +Cc techboard
> >
> > We need benchmark numbers in order to take a decision.
> > Please all, prepare some arguments and numbers so we can discuss the
> > mbuf layout in the next techboard meeting.
> 
> I propose that the techboard considers this from two angels:
> 
> 1. Long term goals and their relative priority. I.e. what can be achieved with
> wide-ranging modifications, requiring yet another ABI break and due notices.
> 
> 2. Short term goals, i.e. what can be achieved for this release.
> 
> 
> My suggestions follow...
> 
> 1. Regarding long term goals:
> 
> I have argued that simple forwarding of non-segmented packets using only the
> first mbuf cache line can be achieved by making three
> modifications:
> 
> a) Move m->tx_offload to the first cache line.
Not all PMDs use this field on Tx. HW might support the checksum offloads
directly, not requiring these fields at all. 


> b) Use an 8 bit pktmbuf mempool index in the first cache line,
>    instead of the 64 bit m->pool pointer in the second cache line.
256 mpool looks enough, as for me. Regarding the indirect access to the pool
(via some table) - it might introduce some performance impact. For example, 
mlx5 PMD strongly relies on pool field for allocating mbufs in Rx datapath.
We're going to update (o-o, we found point to optimize), but for now it does.

> c) Do not access m->next when we know that it is NULL.
>    We can use m->nb_segs == 1 or some other invariant as the gate.
>    It can be implemented by adding an m->next accessor function:
>    struct rte_mbuf * rte_mbuf_next(struct rte_mbuf * m)
>    {
>        return m->nb_segs == 1 ? NULL : m->next;
>    }

Sorry, not sure about this. IIRC, nb_segs is valid in the first segment/mbuf  
only.
If we have the 4 segments in the pkt we see nb_seg=4 in the first one, and the 
nb_seg=1
in the others. The next field is NULL in the last mbuf only. Am I wrong and 
miss something ?

> Regarding the priority of this goal, I guess that simple forwarding of non-
> segmented packets is probably the path taken by the majority of packets
> handled by DPDK.
> 
> An alternative goal could be:
> Do not touch the second cache line during RX.
> A comment in the mbuf structure says so, but it is not true anymore.
> 
> (I guess that regression testing didn't catch this because the tests perform 
> TX
> immediately after RX, so the cache miss just moves from the TX to the RX part
> of the test application.)
> 
> 
> 2. Regarding short term goals:
> 
> The current DPDK source code looks to me like m->next is the most frequently
> accessed field in the second cache line, so it makes sense moving this to the
> first cache line, rather than m->pool.
> Benchmarking may help here.

Moreover, for the segmented packets the packet size is supposed to be large,
and it imposes the relatively low packet rate, so probably optimization of
moving next to the 1st cache line might be negligible at all. Just compare 
148Mpps of
64B pkts and 4Mpps of 3000B pkts over 100Gbps link. Currently we are on 
benchmarking
and did not succeed yet on difference finding. The benefit can't be expressed 
in mpps delta,
we should measure CPU clocks, but Rx queue is almost always empty - we have an 
empty
loops. So, if we have the boost - it is extremely hard to catch one.

With best regards, Slava

>
> 
> If we - without breaking the ABI - can introduce a gate to avoid accessing m-
> >next when we know that it is NULL, we should keep it in the second cache
> line.
> 
> In this case, I would prefer to move m->tx_offload to the first cache line,
> thereby providing a field available for application use, until the application
> prepares the packet for transmission.
> 
> 
> >
> >
> > 01/11/2020 21:59, Morten Brørup:
> > > > From: Thomas Monjalon [mailto:tho...@monjalon.net]
> > > > Sent: Sunday, November 1, 2020 5:38 PM
> > > >
> > > > 01/11/2020 10:12, Morten Brørup:
> > > > > One thing has always puzzled me:
> > > > > Why do we use 64 bits to indicate which memory pool an mbuf
> > > > > belongs to?
> > > > > The portid only uses 16 bits and an indirection index.
> > > > > Why don't we use the same kind of indirection index for mbuf
> > pools?
> > > >
> > > > I wonder what would be the cost of indirection. Probably
> > neglectible.
> > >
> > > Probably. The portid does it, and that indirection is heavily used
> > everywhere.
> > >
> > > The size of mbuf memory pool indirection array should be compile
> > > time
> > configurable, like the size of the portid indirection array.
> > >
> > > And for reference, the indirection array will fit into one cache
> > > line
> > if we default to 8 mbuf pools, thus supporting an 8 CPU socket system
> > with one mbuf pool per CPU socket, or a 4 CPU socket system with two
> > mbuf pools per CPU socket.
> > >
> > > (And as a side note: Our application is optimized for single-socket
> > systems, and we only use one mbuf pool. I guess many applications were
> > developed without carefully optimizing for multi-socket systems, and
> > also just use one mbuf pool. In these cases, the mbuf structure
> > doesn't really need a pool field. But it is still there, and the DPDK
> > libraries use it, so we didn't bother removing it.)
> > >
> > > > I think it is a good proposal...
> > > > ... for next year, after a deprecation notice.
> > > >
> > > > > I can easily imagine using one mbuf pool (or perhaps a few
> > > > > pools) per CPU socket (or per physical memory bus closest to an
> > > > > attached
> > NIC),
> > > > > but not more than 256 mbuf memory pools in total.
> > > > > So, let's introduce an mbufpoolid like the portid, and cut this
> > > > > mbuf field down from 64 to 8 bits.
> >
> > We will need to measure the perf of the solution.
> > There is a chance for the cost to be too much high.
> >
> >
> > > > > If we also cut down m->pkt_len from 32 to 24 bits,
> > > >
> > > > Who is using packets larger than 64k? Are 16 bits enough?
> > >
> > > I personally consider 64k a reasonable packet size limit. Exotic
> > applications with even larger packets would have to live with this
> > constraint. But let's see if there are any objections. For reference,
> > 64k corresponds to ca. 44 Ethernet (1500 byte) packets.
> > >
> > > (The limit could be 65535 bytes, to avoid translation of the value 0
> > into 65536 bytes.)
> > >
> > > This modification would go nicely hand in hand with the mbuf pool
> > indirection modification.
> > >
> > > ... after yet another round of ABI stability discussions,
> > depreciation notices, and so on. :-)
> >
> > After more thoughts, I'm afraid 64k is too small in some cases.
> > And 24-bit manipulation would probably break performance.
> > I'm afraid we are stuck with 32-bit length.
> 
> Yes, 24 bit manipulation would probably break performance.
> 
> Perhaps a solution exists with 16 bits (least significant bits) for the common
> cases, and 8 bits more (most significant bits) for the less common cases. Just
> thinking out loud here...
> 
> >
> > > > > we can get the 8 bit mbuf pool index into the first cache line
> > > > > at no additional cost.
> > > >
> > > > I like the idea.
> > > > It means we don't need to move the pool pointer now, i.e. it does
> > > > not have to replace the timestamp field.
> > >
> > > Agreed! Don't move m->pool to the first cache line; it is not used
> > for RX.
> > >
> > > >
> > > > > In other words: This would free up another 64 bit field in the
> > mbuf
> > > > structure!
> > > >
> > > > That would be great!
> > > >
> > > >
> > > > > And even though the m->next pointer for scattered packets
> > > > > resides in the second cache line, the libraries and application
> > > > > knows that m->next is NULL when m->nb_segs is 1.
> > > > > This proves that my suggestion would make touching the second
> > > > > cache line unnecessary (in simple cases), even for
> > > > > re-initializing the mbuf.
> > > >
> > > > So you think the "next" pointer should stay in the second half of
> > mbuf?
> > > >
> > > > I feel you would like to move the Tx offloads in the first half to
> > > > improve performance of very simple apps.
> > >
> > > "Very simple apps" sounds like a minority of apps. I would rather
> > > say
> > "very simple packet handling scenarios", e.g. forwarding of normal
> > size non-segmented packets. I would guess that the vast majority of
> > packets handled by DPDK applications actually match this scenario. So
> > I'm proposing to optimize for what I think is the most common scenario.
> > >
> > > If segmented packets are common, then m->next could be moved to the
> > first cache line. But it will only improve the pure RX steps of the
> > pipeline. When preparing the packet for TX, m->tx_offloads will need
> > to be set, and the second cache line comes into play. So I'm wondering
> > how big the benefit of having m->next in the first cache line really
> > is - assuming that m->nb_segs will be checked before accessing m->next.
> > >
> > > > I am thinking the opposite: we could have some dynamic fields
> > > > space in the first half to improve performance of complex Rx.
> > > > Note: we can add a flag hint for field registration in this first
> > half.
> > > >
> > >
> > > I have had the same thoughts. However, I would prefer being able to
> > forward ordinary packets without using the second mbuf cache line at
> > all (although only in specific scenarios like my example above).
> > >
> > > Furthermore, the application can abuse the 64 bit m->tx_offload
> > > field
> > for private purposes until it is time to prepare the packet for TX and
> > pass it on to the driver. This hack somewhat resembles a dynamic field
> > in the first cache line, and will not be possible if the m->pool or m-
> > >next field is moved there.
> >
> >
> >

Reply via email to