RE: [PATCH] mempool: optimize incomplete cache handling

Morten Brørup Mon, 10 Jan 2022 02:55:47 -0800

+Bruce; you seemed interested in my work in this area.

> From: Jerin Jacob [mailto:[email protected]]
> Sent: Monday, 10 January 2022 08.27
> 
> On Fri, Jan 7, 2022 at 2:16 PM Morten Brørup <[email protected]>
> wrote:
> >
> > > From: Jerin Jacob [mailto:[email protected]]
> > > Sent: Thursday, 6 January 2022 17.55
> > >
> > > On Thu, Jan 6, 2022 at 5:54 PM Morten Brørup
> <[email protected]>
> > > wrote:
> > > >
> > > > A flush threshold for the mempool cache was introduced in DPDK
> > > version
> > > > 1.3, but rte_mempool_do_generic_get() was not completely updated
> back
> > > > then.
> > > >
> > > > The incompleteness did not cause any functional bugs, so this
> patch
> > > > could be considered refactoring for the purpose of cleaning up.
> > > >
> > > > This patch completes the update of rte_mempool_do_generic_get()
> as
> > > > follows:
> > > >
> > > > 1. A few comments were malplaced or no longer correct.
> > > > Some comments have been updated/added/corrected.
> > > >
> > > > 2. The code that initially screens the cache request was not
> updated.
> > > > The initial screening compared the request length to the cache
> size,
> > > > which was correct before, but became irrelevant with the
> introduction
> > > of
> > > > the flush threshold. E.g. the cache can hold up to flushthresh
> > > objects,
> > > > which is more than its size, so some requests were not served
> from
> > > the
> > > > cache, even though they could be.
> > > > The initial screening has now been corrected to match the initial
> > > > screening in rte_mempool_do_generic_put(), which verifies that a
> > > cache
> > > > is present, and that the length of the request does not overflow
> the
> > > > memory allocated for the cache.
> > > >
> > > > 3. The code flow for satisfying the request from the cache was
> weird.
> > > > The likely code path where the objects are simply served from the
> > > cache
> > > > was treated as unlikely; now it is treated as likely.
> > > > And in the code path where the cache was backfilled first,
> numbers
> > > were
> > > > added and subtracted from the cache length; now this code path
> simply
> > > > sets the cache length to its final value.
> > > >
> > > > 4. The objects were returned in reverse order.
> > > > Returning the objects in reverse order is not necessary, so
> > > rte_memcpy()
> > > > is now used instead.
> > >
> > > Have you checked the performance with network workload?
> > > IMO, reverse order makes sense(LIFO vs FIFO).
> > > The LIFO makes the cache warm as the same buffers are reused
> > > frequently.
> >
> > I have not done any performance testing. We probably agree that the
> only major difference lies in how the objects are returned. And we
> probably also agree that rte_memcpy() is faster than the copy loop it
> replaced, especially when n is constant at compile time. So the
> performance difference mainly depends on the application, which I will
> discuss below.
> >
> > Let's first consider LIFO vs. FIFO.
> >
> > The key argument for the rte_memcpy() optimization is that we are
> still getting the burst of objects from the top of the stack (LIFO);
> only the order of the objects inside the burst is not reverse anymore.
> >
> > Here is an example:
> >
> > The cache initially contains 8 objects: 01234567.
> >
> > 8 more objects are put into the cache: 89ABCDEF.
> >
> > The cache now holds: 0123456789ABCDEF.
> 
> Agree. However I think, it may matter with less sized L1 cache
> machines and burst size is more where it plays role what can be in L1
> with the scheme.


Good point! Thinking further about it made me realize that the mempool cache 
flushing algorithm is fundamentally flawed, at least in some cases...


rte_mempool_do_generic_put():

When putting objects into the cache, and the cache length exceeds the flush 
threshold, the most recent (hot) objects are flushed to the ring, thus leaving 
the less recent (colder) objects at the top of the cache stack.

Example (cache size: 8, flush threshold: 12, put 8 objects):

Initial cache: 01234567

Cache after putting (hot) objects 89ABCDEF: 0123456789ABCDEF

Cache flush threshold reached. Resulting cache: 01234567

Furthermore, the cache has to be completely depleted before the hot objects 
that were flushed to the ring are retrieved from the ring again.


rte_mempool_do_generic_get():

When getting objects from the cache, and the cache does not hold the requested 
number of objects, the cache will first be backfilled from the ring, thus 
putting colder objects at the top of the cache stack, and then the objects will 
be returned from the top of the cache stack, i.e. the backfilled (cold) objects 
will be returned first.

Example (cache size: 8, get 8 objects):

Initial cache: 0123 (hot or lukewarm)

Cache after backfill to size + requested objects: 0123456789ABCDEF

Returned objects: FEDCBA98 (cold)

Cache after returning objects: 012345678 (i.e. cold objects at the top)


> 
> I would suggest splitting each performance improvement as a separate
> patch for better tracking and quantity of the performance improvement.

With the new realizations above, I should reconsider my patch from scratch.

I have also been wondering if the mempool cache size really needs to be 
configurable, or it could be fixed size?

Bruce mentioned in another thread 
(http://inbox.dpdk.org/dev/ydv%[email protected]/T/#m02cabb25655c08a0980888df8c41aba9ac8dd6ff)
 that the typical configuration of the cache size is RTE_MEMPOOL_CACHE_MAX_SIZE.

I would dare to say that it probably suffices to configure if the mempool has a 
cache or not! The cache_size parameter of rte_mempool_create() is not respected 
1:1 anyway (because each per-lcore cache may consume up to 1.5 x cache_size 
objects from the mempool backing store), so the cache_size parameter could be 
parsed as non-zero vs. zero to determine if a cache is wanted or not.

> 
> I think, mempool performance test and tx only stream mode in testpmd
> can quantify patches.
> 
> 
> 
> >
> > Getting 4 objects from the cache gives us CDEF instead of FEDC, i.e.
> we are still getting the 4 objects most recently put into the cache.
> >
> > Furthermore, if the application is working with fixed size bursts, it
> will usually put and get the same size burst, i.e. put the burst
> 89ABCDEF into the cache, and then get the burst 89ABCDEF from the cache
> again.
> >
> >
> > Here is an example unfavorable scenario:
> >
> > The cache initially contains 4 objects, which have gone cold: 0123.
> >
> > 4 more objects, which happen to be hot, are put into the cache: 4567.
> >
> > Getting 8 objects from the cache gives us 01234567 instead of
> 76543210.
> >
> > Now, if the application only processes the first 4 of the 8 objects
> in the burst, it would have benefitted from those objects being the hot
> 7654 objects instead of the cold 0123 objects.
> >
> > However, I think that most applications process the complete burst,
> so I do consider this scenario unlikely.
> >
> > Similarly, a pipelined application doesn't process objects in reverse
> order at every other step in the pipeline, even though the previous
> step in the pipeline most recently touched the last object of the
> burst.
> >
> >
> > My overall conclusion was that the benefit of using rte_memcpy()
> outweighs the disadvantage of the unfavorable scenario, because I
> consider the probability of the unfavorable scenario occurring very
> low. But again: it mainly depends on the application.
> >
> > If anyone disagrees with the risk analysis described above, I will
> happily provide a version 2 of the patch, where the objects are still
> returned in reverse order. After all, the rte_memcpy() benefit is
> relatively small compared to the impact if the unlikely scenario
> occurs.
> >

RE: [PATCH] mempool: optimize incomplete cache handling

Reply via email to