Hi, On Mon, Jul 01, 2019 at 02:21:41PM +0000, Wang, Xiao W wrote: > Hi, > > > -----Original Message----- > > From: Olivier Matz [mailto:olivier.m...@6wind.com] > > Sent: Monday, July 1, 2019 9:11 PM > > To: Andrew Rybchenko <arybche...@solarflare.com> > > Cc: Wang, Xiao W <xiao.w.w...@intel.com>; dev@dpdk.org > > Subject: Re: [PATCH] mempool: optimize copy in cache get > > > > Hi, > > > > On Tue, May 21, 2019 at 12:34:55PM +0300, Andrew Rybchenko wrote: > > > On 5/21/19 12:03 PM, Xiao Wang wrote: > > > > Use rte_memcpy to improve the pointer array copy. This optimization > > method > > > > has already been applied to __mempool_generic_put() [1], this patch > > applies > > > > it to __mempool_generic_get(). Slight performance gain can be observed > > in > > > > testpmd txonly test. > > > > > > > > [1] 863bfb47449 ("mempool: optimize copy in cache") > > > > > > > > Signed-off-by: Xiao Wang <xiao.w.w...@intel.com> > > > > --- > > > > lib/librte_mempool/rte_mempool.h | 7 +------ > > > > 1 file changed, 1 insertion(+), 6 deletions(-) > > > > > > > > diff --git a/lib/librte_mempool/rte_mempool.h > > b/lib/librte_mempool/rte_mempool.h > > > > index 8053f7a04..975da8d22 100644 > > > > --- a/lib/librte_mempool/rte_mempool.h > > > > +++ b/lib/librte_mempool/rte_mempool.h > > > > @@ -1344,15 +1344,11 @@ __mempool_generic_get(struct > > rte_mempool *mp, void **obj_table, > > > > unsigned int n, struct rte_mempool_cache *cache) > > > > { > > > > int ret; > > > > - uint32_t index, len; > > > > - void **cache_objs; > > > > /* No cache provided or cannot be satisfied from cache */ > > > > if (unlikely(cache == NULL || n >= cache->size)) > > > > goto ring_dequeue; > > > > - cache_objs = cache->objs; > > > > - > > > > /* Can this be satisfied from the cache? */ > > > > if (cache->len < n) { > > > > /* No. Backfill the cache first, and then fill from it > > > > */ > > > > @@ -1375,8 +1371,7 @@ __mempool_generic_get(struct rte_mempool > > *mp, void **obj_table, > > > > } > > > > /* Now fill in the response ... */ > > > > - for (index = 0, len = cache->len - 1; index < n; ++index, len--, > > obj_table++) > > > > - *obj_table = cache_objs[len]; > > > > + rte_memcpy(obj_table, &cache->objs[cache->len - n], sizeof(void > > > > *) * > > n); > > > > cache->len -= n; > > > > > > I think the idea of the loop above is to get objects in reverse order to > > > order > > > to reuse cache top objects (put last) first. It should improve cache hit > > > etc. > > > So, performance effect of the patch could be very different on various > > > CPUs > > > (with different cache sizes) and various work-loads. > > > > > > So, I doubt that it is a step in right direction. > > > > For reference, this was already discussed 3 years ago: > > > > https://mails.dpdk.org/archives/dev/2016-May/039873.html > > https://mails.dpdk.org/archives/dev/2016-June/040029.html > > > > I'm still not convinced that reversing object addresses (as it's done > > today) is really important. But Andrew is probably right, the impact of > > this kind of patch probably varies depending on many factors. More > > performance numbers on real-life use-cases would help to decide what to > > do. > > > > Regards, > > Olivier > > I agree, and thanks for the reference link. So theoretically neither way can > be > a definite best choice, it depends on various real-life factors. I'm thinking > about > how to let app developer be aware of this so that they themselves could make > the choice. Or it's not worth doing due to small perf gain?
I don't think it's worth doing a dynamic selection for this. On the other hand, having performance numbers for different use-cases/archs is welcome. From a pure cpu cycles perspective, the rte_memcpy() should be faster. Regards, Olivier