> From: Bruce Richardson [mailto:bruce.richard...@intel.com] > Sent: Tuesday, 18 April 2023 13.07 > > On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > > When getting objects from the mempool, the number of objects to get is > > often constant at build time. > > > > This patch adds another code path for this case, so the compiler can > > optimize more, e.g. unroll the copy loop when the entire request is > > satisfied from the cache. > > > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0, > > mempool_perf_test with constant n shows an increase in rate_persec by an > > average of 17 %, minimum 9.5 %, maximum 24 %. > > > > The code path where the number of objects to get is unknown at build time > > remains essentially unchanged. > > > > Signed-off-by: Morten Brørup <m...@smartsharesystems.com> > > Change looks a good idea. Some suggestions inline below, which you may want to > take on board for any future version. I'd strongly suggest adding some > extra clarifying code comments, as I suggest below. > With those exta code comments: > > Acked-by: Bruce Richardson <bruce.richard...@intel.com> > > > --- > > lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++--- > > 1 file changed, 21 insertions(+), 3 deletions(-) > > > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > > index 9f530db24b..ade0100ec7 100644 > > --- a/lib/mempool/rte_mempool.h > > +++ b/lib/mempool/rte_mempool.h > > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, > void **obj_table, > > if (unlikely(cache == NULL)) > > goto driver_dequeue; > > > > - /* Use the cache as much as we have to return hot objects first */ > > - len = RTE_MIN(remaining, cache->len); > > cache_objs = &cache->objs[cache->len]; > > + > > + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { > > + /* > > + * The request size is known at build time, and > > + * the entire request can be satisfied from the cache, > > + * so let the compiler unroll the fixed length copy loop. > > + */ > > + cache->len -= n; > > + for (index = 0; index < n; index++) > > + *obj_table++ = *--cache_objs; > > + > > This loop looks a little awkward to me. Would it be clearer (and perhaps > easier for compilers to unroll efficiently if it was rewritten as: > > cache->len -= n; > cache_objs = &cache->objs[cache->len]; > for (index = 0; index < n; index++) > obj_table[index] = cache_objs[index];
The mempool cache is a stack, so the copy loop needs get the objects in decrementing order. I.e. the source index decrements and the destination index increments. Regardless, your point here is still valid! I expected that any unrolling capable compiler can unroll *dst++ = *--src; but I can experiment with different compilers on godbolt.org to see if dst[index] = src[-index] is better. A future version could be hand coded with vector instructions here, like in the Ethdev drivers. > > alternatively those last two lines can be replaced by a memcpy, which the > compiler should nicely optimize itself, for constant size copy: > > memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n); Just for reference: It was previously discussed to ignore the stack ordering and simply copy the objects, but the idea was rejected due to the potential performance impact of not returning the hottest objects, i.e. the objects at the top of the stack, as the first in the returned array. > > > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n); > > + > > + return 0; > > + } > > + > > + /* Use the cache as much as we have to return hot objects first */ > > + len = __extension__(__builtin_constant_p(n)) ? cache->len : > > + RTE_MIN(remaining, cache->len); > > This line confused me a lot, until I applied the patch and stared at it a > lot :-). Why would the length value depend upon whether or not n is a > compile-time constant? > I think it would really help here to add in a comment stating that when n > is a compile-time constant, at this point it much be "> cache->len" since > the previous block was untaken, therefore we don't need to call RTE_MIN. I agree. This is almost like perl... write-only source code. I will add a few explanatory comments. > > > cache->len -= len; > > remaining -= len; > > for (index = 0; index < len; index++) > > *obj_table++ = *--cache_objs; > > > > - if (remaining == 0) { > > + if (!__extension__(__builtin_constant_p(n)) && remaining == 0) { > > /* The entire request is satisfied from the cache. */ > > > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > > Once I'd worked out the logic for the above conditional check, then this > conditional adjustment was clear. I just wonder if a further comment might > help here. > > I am also wondering if having larger blocks for the constant and > non-constant cases may help. It would lead to some duplication but may > clarify the code. I get your point, but the code is still short enough to grasp (after comments have been added). So I prefer to avoid code duplication. > > > -- > > 2.17.1 > >