On Thu, Oct 27, 2022 at 02:11:29PM +0200, Morten Brørup wrote: > > From: Olivier Matz [mailto:olivier.m...@6wind.com] > > Sent: Thursday, 27 October 2022 13.43 > > > > On Thu, Oct 27, 2022 at 11:22:07AM +0200, Morten Brørup wrote: > > > > From: Olivier Matz [mailto:olivier.m...@6wind.com] > > > > Sent: Thursday, 27 October 2022 10.35 > > > > > > > > Hi Morten, > > > > > > > > On Wed, Oct 26, 2022 at 04:44:36PM +0200, Morten Brørup wrote: > > > > > Add __rte_cache_aligned to the objs array. > > > > > > > > > > It makes no difference in the general case, but if get/put > > operations > > > > are > > > > > always 32 objects, it will reduce the number of memory (or last > > level > > > > > cache) accesses from five to four 64 B cache lines for every > > get/put > > > > > operation. > > > > > > > > > > For readability reasons, an example using 16 objects follows: > > > > > > > > > > Currently, with 16 objects (128B), we access to 3 > > > > > cache lines: > > > > > > > > > > ┌────────┐ > > > > > │len │ > > > > > cache │********│--- > > > > > line0 │********│ ^ > > > > > │********│ | > > > > > ├────────┤ | 16 objects > > > > > │********│ | 128B > > > > > cache │********│ | > > > > > line1 │********│ | > > > > > │********│ | > > > > > ├────────┤ | > > > > > │********│_v_ > > > > > cache │ │ > > > > > line2 │ │ > > > > > │ │ > > > > > └────────┘ > > > > > > > > > > With the alignment, it is also 3 cache lines: > > > > > > > > > > ┌────────┐ > > > > > │len │ > > > > > cache │ │ > > > > > line0 │ │ > > > > > │ │ > > > > > ├────────┤--- > > > > > │********│ ^ > > > > > cache │********│ | > > > > > line1 │********│ | > > > > > │********│ | > > > > > ├────────┤ | 16 objects > > > > > │********│ | 128B > > > > > cache │********│ | > > > > > line2 │********│ | > > > > > │********│ v > > > > > └────────┘--- > > > > > > > > > > However, accessing the objects at the bottom of the mempool cache > > is > > > > a > > > > > special case, where cache line0 is also used for objects. > > > > > > > > > > Consider the next burst (and any following bursts): > > > > > > > > > > Current: > > > > > ┌────────┐ > > > > > │len │ > > > > > cache │ │ > > > > > line0 │ │ > > > > > │ │ > > > > > ├────────┤ > > > > > │ │ > > > > > cache │ │ > > > > > line1 │ │ > > > > > │ │ > > > > > ├────────┤ > > > > > │ │ > > > > > cache │********│--- > > > > > line2 │********│ ^ > > > > > │********│ | > > > > > ├────────┤ | 16 objects > > > > > │********│ | 128B > > > > > cache │********│ | > > > > > line3 │********│ | > > > > > │********│ | > > > > > ├────────┤ | > > > > > │********│_v_ > > > > > cache │ │ > > > > > line4 │ │ > > > > > │ │ > > > > > └────────┘ > > > > > 4 cache lines touched, incl. line0 for len. > > > > > > > > > > With the proposed alignment: > > > > > ┌────────┐ > > > > > │len │ > > > > > cache │ │ > > > > > line0 │ │ > > > > > │ │ > > > > > ├────────┤ > > > > > │ │ > > > > > cache │ │ > > > > > line1 │ │ > > > > > │ │ > > > > > ├────────┤ > > > > > │ │ > > > > > cache │ │ > > > > > line2 │ │ > > > > > │ │ > > > > > ├────────┤ > > > > > │********│--- > > > > > cache │********│ ^ > > > > > line3 │********│ | > > > > > │********│ | 16 objects > > > > > ├────────┤ | 128B > > > > > │********│ | > > > > > cache │********│ | > > > > > line4 │********│ | > > > > > │********│_v_ > > > > > └────────┘ > > > > > Only 3 cache lines touched, incl. line0 for len. > > > > > > > > I understand your logic, but are we sure that having an application > > > > that > > > > works with bulks of 32 means that the cache will stay aligned to 32 > > > > elements for the whole life of the application? > > > > > > > > In an application, the alignment of the cache can change if you > > have > > > > any of: > > > > - software queues (reassembly for instance) > > > > - packet duplication (bridge, multicast) > > > > - locally generated packets (keepalive, control protocol) > > > > - pipeline to other cores > > > > > > > > Even with testpmd, which work by bulk of 32, I can see that the > > size > > > > of the cache filling is not aligned to 32. Right after starting the > > > > application, we already have this: > > > > > > > > internal cache infos: > > > > cache_size=250 > > > > cache_count[0]=231 > > > > > > > > This is probably related to the hw rx rings size, number of queues, > > > > number of ports. > > > > > > > > The "250" default value for cache size in testpmd is questionable, > > but > > > > with --mbcache=256, the behavior is similar. > > > > > > > > Also, when we transmit to a NIC, the mbufs are not returned > > immediatly > > > > to the pool, they may stay in the hw tx ring during some time, > > which is > > > > a driver decision. > > > > > > > > After processing traffic on cores 8 and 24 with this testpmd, I > > get: > > > > cache_count[0]=231 > > > > cache_count[8]=123 > > > > cache_count[24]=122 > > > > > > > > In my opinion, it is not realistic to think that the mempool cache > > will > > > > remain aligned to cachelines. In these conditions, it looks better > > to > > > > keep the structure packed to avoid wasting memory. > > > > > > I agree that is a special use case to only access the mempool cache > > in > > > bursts of 32 objects, so the accesses are always cache line > > > aligned. (Generalized, the burst size must not be 32; a burst size > > > that is a multiple of RTE_CACHE_LINE_SIZE/sizeof(void*), i.e. a burst > > > size of 8 on a 64-bit architecture, will do.) > > > > Is there a real situation where it happens to always have read/write > > accesses per bulks of 32? From what I see in my quick test, it is not > > the case, even with testpmd. > > > > > Adding a hole of 52 byte per mempool cache is nothing, considering > > > that the mempool cache already uses 8 KB (RTE_MEMPOOL_CACHE_MAX_SIZE > > * > > > 2 * sizeof(void*) = 1024 * 8 byte) for the objects. > > > > > > Also - assuming that memory allocations are cache line aligned - the > > > 52 byte of unused memory cannot be used regardless if they are before > > > or after the objects. Instead of having 52 B unused after the > > objects, > > > we might as well have a hole of 52 B unused before the objects. In > > > other words: There is really no downside to this. > > > > Correct, the memory waste argument to nack the patch is invalid. > > > > > Jerin also presented a separate argument for moving the objects to > > > another cache line than the len field: The risk for "load-after-store > > > stall" when loading the len field after storing objects in cache > > line0 > > > [1]. > > > > > > [1]: http://inbox.dpdk.org/dev/CALBAE1P4zFYdLwoQukn5Q-V- > > ntvc_ubwmwjhav2uvbxqryt...@mail.gmail.com/ > > > > I'll be prudent on this justification without numbers. The case where > > we > > access to the objects of the first cache line (among several KB) is > > maybe not that frequent. > > > > > A new idea just popped into my head: The hot debug statistics > > > counters (put_bulk, put_objs, get_success_bulk, get_success_objs) > > > could be moved to this free space, reducing the need to touch another > > > cache line for debug counters. I haven’t thought this idea through > > > yet; it might conflict with Jerin's comment. > > > > Yes, but since the stats are only enabled when RTE_LIBRTE_MEMPOOL_DEBUG > > is set, it won't have any impact on non-debug builds. > > Correct, but I do expect that it would reduce the performance cost of using > RTE_LIBRTE_MEMPOOL_DEBUG. I'll provide such a patch shortly. > > > > > > > Honnestly, I find it hard to convince myself that it is a real > > optimization. I don't see any reason why it would be slower though. So > > since we already broke the mempool cache struct ABI in a previous > > commit, and since it won't consume more memory, I'm ok to include that > > patch. > > I don't know if there are any such applications now, and you are probably > right that there are not. But this patch opens a road towards it. > > Acked-by ?
Acked-by: Olivier Matz <olivier.m...@6wind.com> Thanks Morten > > > It would be great to have numbers to put some weight in the > > balance. > > Yes, it would also be great if drivers didn't copy-paste code from the > mempool library, so the performance effect of modifications in the mempool > library would be reflected in such tests. > > > > > > > > > > > > > > > > > > > > Olivier > > > > > > > > > > > > > > > > > > Credits go to Olivier Matz for the nice ASCII graphics. > > > > > > > > > > Signed-off-by: Morten Brørup <m...@smartsharesystems.com> > > > > > --- > > > > > lib/mempool/rte_mempool.h | 6 ++++-- > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/lib/mempool/rte_mempool.h > > b/lib/mempool/rte_mempool.h > > > > > index 1f5707f46a..3725a72951 100644 > > > > > --- a/lib/mempool/rte_mempool.h > > > > > +++ b/lib/mempool/rte_mempool.h > > > > > @@ -86,11 +86,13 @@ struct rte_mempool_cache { > > > > > uint32_t size; /**< Size of the cache */ > > > > > uint32_t flushthresh; /**< Threshold before we flush excess > > > > elements */ > > > > > uint32_t len; /**< Current cache count */ > > > > > - /* > > > > > + /** > > > > > + * Cache objects > > > > > + * > > > > > * Cache is allocated to this size to allow it to overflow > > in > > > > certain > > > > > * cases to avoid needless emptying of cache. > > > > > */ > > > > > - void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2]; /**< Cache > > objects */ > > > > > + void *objs[RTE_MEMPOOL_CACHE_MAX_SIZE * 2] > > __rte_cache_aligned; > > > > > } __rte_cache_aligned; > > > > > > > > > > /** > > > > > -- > > > > > 2.17.1 > > > > > > > > >