> From: Konstantin Ananyev [mailto:konstantin.anan...@huawei.com] > Sent: Friday, 23 December 2022 17.58 > > > > From: Konstantin Ananyev [mailto:konstantin.anan...@huawei.com] > > > Sent: Thursday, 22 December 2022 16.57 > > > > > > > Zero-copy access to mempool caches is beneficial for PMD > performance, > > > and > > > > must be provided by the mempool library to fix [Bug 1052] without > a > > > > performance regression. > > > > > > LGTM in general, thank you for working on it. > > > Few comments below.
[...] > > > RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE); > > > I think it is too excessive. > > > Just: > > > if (n <= RTE_MEMPOOL_CACHE_MAX_SIZE) return NULL; > > > seems much more convenient for the users here and > > > more close to other mempool/ring API behavior. > > > In terms of performance - I don’t think one extra comparison here > > > would really count. > > > > The insignificant performance degradation seems like a good tradeoff > for making the function more generic. > > I will update the function documentation and place the run-time check > here, so both trace and stats reflect what happened: > > > > RTE_ASSERT(cache != NULL); > > RTE_ASSERT(mp != NULL); > > - RTE_ASSERT(n <= RTE_MEMPOOL_CACHE_MAX_SIZE); > > > > rte_mempool_trace_cache_zc_put_bulk(cache, mp, n); > > + > > + if (unlikely(n > RTE_MEMPOOL_CACHE_MAX_SIZE)) { > > + rte_errno = -ENOSPC; // Or EINVAL? > > + return NULL; > > + } > > > > /* Increment stats now, adding in mempool always succeeds. */ > > > > I will probably also be able to come up with solution for > zc_get_bulk(), so both trace and stats make sense if called with n > > > RTE_MEMPOOL_CACHE_MAX_SIZE. I have sent a new patch, where I switched to the same code flow as in the micro-optimization patch, so this run-time check doesn't affect the most common case. Also, I realized that I need to compare to the cache flush threshold instead of RTE_MEMPOOL_CACHE_MAX_SIZE, to respect the cache size. Otherwise, a zc_cache_get() operation could deplete a small mempool; and zc_cache_put() could leave the cache with too many objects, thus violating the invariant that cache->len <= cache->flushthreshold. > > > > > > > > I also think would be really good to add: > > > add zc_(get|put)_bulk_start(), zc_(get|put)_bulk_finish(). > > > Where _start would check/fill the cache and return the pointer, > > > while _finsih would updathe cache->len. > > > Similar to what we have for rte_ring _peek_ API. > > > That would allow to extend this API usage - let say inside PMDs > > > it could be used not only for MBUF_FAST_FREE case, but for generic > > > TX code path (one that have to call rte_mbuf_prefree()) also. > > > > I don't see a use case for zc_get_start()/_finish(). > > > > And since the mempool cache is a stack, it would *require* that the > application reads the array in reverse order. In such case, the > > function should not return a pointer to the array of objects, but a > pointer to the top of the stack. > > > > So I prefer to stick with the single-function zero-copy get, i.e. > without start/finish. > > Yes, it would be more complicated than just update cache->len. > I don't have any real use-case for _get_ too - mostly just for symmetry > with put. > > > > > > > I do agree with you about the use case for zc_put_start()/_finish(). > > > > Unlike the ring, there is no need for locking with the mempool cache, > so we can implement something much simpler: > > > > Instead of requiring calling both zc_put_start() and _finish() for > every zero-copy burst, we could add a zc_put_rewind() function, only > > to be called if some number of objects were not added anyway: > > > > /* FIXME: Function documentation here. */ > > __rte_experimental > > static __rte_always_inline void > > rte_mempool_cache_zc_put_rewind(struct rte_mempool_cache *cache, > > unsigned int n) > > { > > RTE_ASSERT(cache != NULL); > > RTE_ASSERT(n <= cache->len); > > > > rte_mempool_trace_cache_zc_put_rewind(cache, n); > > > > /* Rewind stats. */ > > RTE_MEMPOOL_CACHE_STAT_ADD(cache, put_objs, -n); > > > > cache->len -= n; > > } > > > > I have a strong preference for _rewind() over _start() and _finish(), > because in the full burst case, it only touches the > > rte_mempool_cache structure once, whereas splitting it up into > _start() and _finish() touches the rte_mempool_cache structure both > > before and after copying the array of objects. > > > > What do you think? > > And your concern is that between _get_start(_C_) and get_finish(_C_) > the _C_ > cache line can be bumped out of CPU Dcache, right? > I don't think such situation would be a common one. Yes, that is the essence of my concern. And I agree that it is probably uncommon. There might also be some performance benefits by having the load/store/modify of _C_ closely together; but I don't know enough about CPU internals to determine if significant or not. > But, if you think _rewind_ is a better approach - I am ok with it. Thank you. [...] > > > Would be great to add some test-cases in app/test to cover this new > > > API. > > > > Good point. I will look at it. > > > > BTW: Akshitha already has zc_put_bulk working in the i40e PMD. > > That's great news, but I suppose it would be good to have some UT for > it anyway. > Konstantin I don't have time for adding unit tests now, but sent an updated patch anyway, so the invariant bug doesn't bite Akshitha. Merry Christmas, everyone! -Morten