On Tue, Apr 11, 2023 at 08:48:45AM +0200, Morten Brørup wrote: > When getting objects from the mempool, the number of objects to get is > often constant at build time. > > This patch adds another code path for this case, so the compiler can > optimize more, e.g. unroll the copy loop when the entire request is > satisfied from the cache. > > On an Intel(R) Xeon(R) E5-2620 v4 CPU, and compiled with gcc 9.4.0, > mempool_perf_test with constant n shows an increase in rate_persec by an > average of 17 %, minimum 9.5 %, maximum 24 %. > > The code path where the number of objects to get is unknown at build time > remains essentially unchanged. > > Signed-off-by: Morten Brørup <m...@smartsharesystems.com>
Change looks a good idea. Some suggestions inline below, which you may want to take on board for any future version. I'd strongly suggest adding some extra clarifying code comments, as I suggest below. With those exta code comments: Acked-by: Bruce Richardson <bruce.richard...@intel.com> > --- > lib/mempool/rte_mempool.h | 24 +++++++++++++++++++++--- > 1 file changed, 21 insertions(+), 3 deletions(-) > > diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h > index 9f530db24b..ade0100ec7 100644 > --- a/lib/mempool/rte_mempool.h > +++ b/lib/mempool/rte_mempool.h > @@ -1500,15 +1500,33 @@ rte_mempool_do_generic_get(struct rte_mempool *mp, > void **obj_table, > if (unlikely(cache == NULL)) > goto driver_dequeue; > > - /* Use the cache as much as we have to return hot objects first */ > - len = RTE_MIN(remaining, cache->len); > cache_objs = &cache->objs[cache->len]; > + > + if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { > + /* > + * The request size is known at build time, and > + * the entire request can be satisfied from the cache, > + * so let the compiler unroll the fixed length copy loop. > + */ > + cache->len -= n; > + for (index = 0; index < n; index++) > + *obj_table++ = *--cache_objs; > + This loop looks a little awkward to me. Would it be clearer (and perhaps easier for compilers to unroll efficiently if it was rewritten as: cache->len -= n; cache_objs = &cache->objs[cache->len]; for (index = 0; index < n; index++) obj_table[index] = cache_objs[index]; alternatively those last two lines can be replaced by a memcpy, which the compiler should nicely optimize itself, for constant size copy: memcpy(obj_table, cache_objs, sizeof(obj_table[0]) * n); > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); > + RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_objs, n); > + > + return 0; > + } > + > + /* Use the cache as much as we have to return hot objects first */ > + len = __extension__(__builtin_constant_p(n)) ? cache->len : > + RTE_MIN(remaining, cache->len); This line confused me a lot, until I applied the patch and stared at it a lot :-). Why would the length value depend upon whether or not n is a compile-time constant? I think it would really help here to add in a comment stating that when n is a compile-time constant, at this point it much be "> cache->len" since the previous block was untaken, therefore we don't need to call RTE_MIN. > cache->len -= len; > remaining -= len; > for (index = 0; index < len; index++) > *obj_table++ = *--cache_objs; > > - if (remaining == 0) { > + if (!__extension__(__builtin_constant_p(n)) && remaining == 0) { > /* The entire request is satisfied from the cache. */ > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, get_success_bulk, 1); Once I'd worked out the logic for the above conditional check, then this conditional adjustment was clear. I just wonder if a further comment might help here. I am also wondering if having larger blocks for the constant and non-constant cases may help. It would lead to some duplication but may clarify the code. > -- > 2.17.1 >