On Tue, Jan 18, 2022 at 09:25:22AM +0100, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richard...@intel.com]
> > Sent: Monday, 17 January 2022 18.35
> > 
> > On Fri, Jan 14, 2022 at 05:36:50PM +0100, Morten Brørup wrote:
> > > A flush threshold for the mempool cache was introduced in DPDK
> > version
> > > 1.3, but rte_mempool_do_generic_get() was not completely updated back
> > > then, and some inefficiencies were introduced.
> > >
> > > This patch fixes the following in rte_mempool_do_generic_get():
> > >
> > > 1. The code that initially screens the cache request was not updated
> > > with the change in DPDK version 1.3.
> > > The initial screening compared the request length to the cache size,
> > > which was correct before, but became irrelevant with the introduction
> > of
> > > the flush threshold. E.g. the cache can hold up to flushthresh
> > objects,
> > > which is more than its size, so some requests were not served from
> > the
> > > cache, even though they could be.
> > > The initial screening has now been corrected to match the initial
> > > screening in rte_mempool_do_generic_put(), which verifies that a
> > cache
> > > is present, and that the length of the request does not overflow the
> > > memory allocated for the cache.
> > >
> > > 2. The function is a helper for rte_mempool_generic_get(), so it must
> > > behave according to the description of that function.
> > > Specifically, objects must first be returned from the cache,
> > > subsequently from the ring.
> > > After the change in DPDK version 1.3, this was not the behavior when
> > > the request was partially satisfied from the cache; instead, the
> > objects
> > > from the ring were returned ahead of the objects from the cache. This
> > is
> > > bad for CPUs with a small L1 cache, which benefit from having the hot
> > > objects first in the returned array. (This is also the reason why
> > > the function returns the objects in reverse order.)
> > > Now, all code paths first return objects from the cache, subsequently
> > > from the ring.
> > >
> > > 3. If the cache could not be backfilled, the function would attempt
> > > to get all the requested objects from the ring (instead of only the
> > > number of requested objects minus the objects available in the ring),
> > > and the function would fail if that failed.
> > > Now, the first part of the request is always satisfied from the
> > cache,
> > > and if the subsequent backfilling of the cache from the ring fails,
> > only
> > > the remaining requested objects are retrieved from the ring.
> > >
> > > 4. The code flow for satisfying the request from the cache was
> > slightly
> > > inefficient:
> > > The likely code path where the objects are simply served from the
> > cache
> > > was treated as unlikely. Now it is treated as likely.
> > > And in the code path where the cache was backfilled first, numbers
> > were
> > > added and subtracted from the cache length; now this code path simply
> > > sets the cache length to its final value.
> > >
> > > 5. Some comments were not correct anymore.
> > > The comments have been updated.
> > > Most importanly, the description of the succesful return value was
> > > inaccurate. Success only returns 0, not >= 0.
> > >
> > > Signed-off-by: Morten Brørup <m...@smartsharesystems.com>
> > > ---
> > 
> > I am a little uncertain about the reversing of the copies taking things
> > out
> > of the mempool - for machines where we are not that cache constrainted
> > will
> > we lose out in possible optimizations where the compiler optimizes the
> > copy
> > loop as a memcpy?
> 
> The objects are also returned in reverse order in the code it replaces, so 
> this behavior is not introduced by this patch; I only describe the reason for 
> it.
> 
> I floated a previous patch, in which the objects were returned in order, but 
> Jerin argued [1] that we should keep it the way it was, unless I could show a 
> performance improvement.
> 
> So I retracted that patch to split it up in two independent patches instead. 
> This patch for get(), and [3] for put().
> 
> While experimenting using rte_memcpy() for these, I couldn't achieve a 
> performance boost - quite the opposite. So I gave up on it.
> 
> Reviewing the x86 variant of rte_memcpy() [2] makes me think that it is 
> inefficient for copying small bulks of pointers, especially when n is unknown 
> at compile time, and its code path goes through a great deal of branches.
>
Thanks for all the explanation.

Reviewed-by: Bruce Richardson <bruce.richard...@intel.com> 

Reply via email to