Hi Jerin, On 06/02/2016 11:39 AM, Jerin Jacob wrote: > On Thu, Jun 02, 2016 at 09:36:34AM +0200, Olivier MATZ wrote: >> I think the LIFO behavior should occur on a per-bulk basis. I mean, >> it should behave like in the exemplaes below: >> >> // pool cache is in state X >> obj1 = mempool_get(mp) >> obj2 = mempool_get(mp) >> mempool_put(mp, obj2) >> mempool_put(mp, obj1) >> // pool cache is back in state X >> >> // pool cache is in state X >> bulk1 = mempool_get_bulk(mp, 16) >> bulk2 = mempool_get_bulk(mp, 16) >> mempool_put_bulk(mp, bulk2, 16) >> mempool_put_bulk(mp, bulk1, 16) >> // pool cache is back in state X >> > > Per entry LIFO behavior make more sense in _bulk_ case as recently en-queued > buffer > comes out for next "get" makes more chance that buffer in Last level cache.
Yes, from a memory cache perspective, I think you are right. In practise, I'm not sure it's so important because on many hw drivers, a packet buffer returns to the pool only after a round of the tx ring. So I'd say it won't make a big difference here. >> Note that today it's not the case for bulks, since object addresses >> are reversed only in get(), we are not back in the original state. >> I don't really see the advantage of this. >> >> Removing the reversing may accelerate the cache in case of bulk get, >> I think. > > I tried in my setup, it was dropping the performance. Have you got > improvement in any setup? I know that the mempool_perf autotest is not representative of a real use case, but it gives a trend. I did a quick test with - the legacy code, - the rte_memcpy in put() - the rte_memcpy in both put() and get() (no reverse) It seems that removing the reversing brings ~50% of enhancement with bulks of 32 (on an westmere): legacy mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32 n_keep=32 rate_persec=839922483 mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32 n_keep=128 rate_persec=849792204 mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32 n_keep=32 rate_persec=1617022156 mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32 n_keep=128 rate_persec=1675087052 mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32 n_keep=32 rate_persec=3202914713 mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32 n_keep=128 rate_persec=3268725963 rte_memcpy in put() (your patch proposal) mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32 n_keep=32 rate_persec=762157465 mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32 n_keep=128 rate_persec=744593817 mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32 n_keep=32 rate_persec=1500276326 mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32 n_keep=128 rate_persec=1461347942 mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32 n_keep=32 rate_persec=2974076107 mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32 n_keep=128 rate_persec=2928122264 rte_memcpy in put() and get() mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32 n_keep=32 rate_persec=974834892 mempool_autotest cache=512 cores=1 n_get_bulk=32 n_put_bulk=32 n_keep=128 rate_persec=1129329459 mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32 n_keep=32 rate_persec=2147798220 mempool_autotest cache=512 cores=2 n_get_bulk=32 n_put_bulk=32 n_keep=128 rate_persec=2232457625 mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32 n_keep=32 rate_persec=4510816664 mempool_autotest cache=512 cores=4 n_get_bulk=32 n_put_bulk=32 n_keep=128 rate_persec=4582421298 This is probably more a measure of the pure CPU cost of the mempool function, without considering the memory cache aspect. So, of course, a real use-case test should be done to confirm or not that it increases the performance. I'll manage to do a test and let you know the result. By the way, not all drivers are allocating or freeing the mbufs by bulk, so this modification would only affect these ones. What driver are you using for your test? Regards, Olivier