<snip> > > > > > > > From: Honnappa Nagarahalli > > > [mailto:honnappa.nagaraha...@arm.com] > > > > > > > Sent: Sunday, 6 November 2022 00.11 > > > > > > > > > > > > > > + Akshitha, she is working on similar patch > > > > > > > > > > > > > > Few comments inline > > > > > > > > > > > > > > > From: Morten Br�rup <m...@smartsharesystems.com> > > > > > > > > Sent: Saturday, November 5, 2022 8:40 AM > > > > > > > > > > > > > > > > Zero-copy access to the mempool cache is beneficial for > > > > > > > > PMD > > > > > > > performance, > > > > > > > > and must be provided by the mempool library to fix [Bug > > > > > > > > 1052] without > > > > > > > a > > > > > > > > performance regression. > > > > > > > > > > > > > > > > [Bug 1052]: https://bugs.dpdk.org/show_bug.cgi?id=1052 > > > > > > > > > > > > > > > > > > > > > > > > This RFC offers a conceptual zero-copy put function, where > > > the > > > > > > > application > > > > > > > > promises to store some objects, and in return gets an > > > > > > > > address > > > > > where > > > > > > > to store > > > > > > > > them. > > > > > > > > > > > > > > > > I would like some early feedback. > > > > > > > > > > > > > > > > Notes: > > > > > > > > * Allowing the 'cache' parameter to be NULL, and getting > > > > > > > > it > > > from > > > > > the > > > > > > > > mempool instead, was inspired by rte_mempool_cache_flush(). > > > > > > > I am not sure why the 'cache' parameter is required for this > > > API. > > > > > This > > > > > > > API should take the mem pool as the parameter. > > > > > > > > > > > > > > We have based our API on 'rte_mempool_do_generic_put' and > > > removed > > > > > > the > > > > > > > 'cache' parameter. > > > > > > > > > > > > I thoroughly considered omitting the 'cache' parameter, but > > > included > > > > > it for > > > > > > two reasons: > > > > > > > > > > > > 1. The function is a "mempool cache" function (i.e. primarily > > > > > > working > > > > > on the > > > > > > mempool cache), not a "mempool" function. > > > > > > > > > > > > So it is appropriate to have a pointer directly to the > > > > > > structure > > > it > > > > > is working on. > > > > > > Following this through, I also made 'cache' the first > > > > > > parameter > > > and > > > > > 'mp' the > > > > > > second, like in rte_mempool_cache_flush(). > > > > > I am wondering if the PMD should be aware of the cache or not. > > > > > For > > > ex: > > > > > in the case of pipeline mode, the RX and TX side of the PMD are > > > > > running on different cores. > > > > > > > > In that example, the PMD can store two cache pointers, one for > > > > each > > > of the > > > > RX and TX side. > > > I did not understand this. If RX core and TX core have their own > > > per- core caches the logic would not work. For ex: the RX core cache > > > would not get filled. > > > > > > In the case of pipeline mode, there will not be a per-core cache. > > > The buffers would be allocated and freed from a global ring or a > > > global lockless stack. > > > > Aha... Now I understand what you mean: You are referring to use cases > where the mempool is configured to *not* have a mempool cache. > > > > For a mempool without a mempool cache, the proposed "mempool cache" > zero-copy functions can obviously not be used. > > > > We need "mempool" zero-copy functions for the mempools that have no > mempool cache. > > > > However, those functions depend on the mempool's underlying backing > store. > > > > E.g. zero-copy access to a ring has certain requirements [1]. > > > > [1]: > > http://doc.dpdk.org/guides/prog_guide/ring_lib.html#ring-peek-zero-cop > > y-api > > > > For a stack, I think it is possible to locklessly zero-copy pop objects. > > But it is > impossible to locklessly zero-copy push elements to a stack; another thread > can race to pop some objects from the stack before the pushing thread has > finished writing them into the stack. > > > > Furthermore, the ring zero-copy get function cannot return a consecutive > array of objects when wrapping, and PMD functions using vector instructions > usually rely on handling chunks of e.g. 8 objects. > > > > Just for a second, let me theorize into the absurd: Even worse, if a > mempool's underlying backing store does not use an array of pointers as its > internal storage structure, it is impossible to use a pointer to an array of > pointers for zero-copy transactions. E.g. if the backing store uses a list or > a > tree structure for its storage, a pointer to somewhere in the list or tree > structure is not an array of objects pointers. > > > > Anyway, we could consider designing a generic API for zero-copy mempool > get/put; but it should be compatible with all underlying backing stores - or > return failure, so the PMD can fall back to the standard functions, if the > mempool is in a state where zero-copy access to a contiguous burst cannot > be provided. E.g. zero-copy get from a ring can return failure when zero-copy > access to the ring is temporarily unavailable due to being at a point where it > would wrap. > > > > Here is a conceptual proposal for such an API. > > > > /* Mempool zero-copy transaction state. Opaque outside the mempool > > API. */ struct rte_mempool_zc_transaction_state { > > char opaque[RTE_CACHE_LINE_SIZE]; > > }; > > > > /** Start zero-copy get/put bulk transaction. > > * > > * @param[in] mp > > * Pointer to the mempool. > > * @param[out] obj_table_ptr > > * Where to store the pointer to > > * the zero-copy array of objects in the mempool. > > * @param[in] n > > * Number of objects in the transaction. > > * @param[in] cache > > * Pointer to the mempool cache. May be NULL if unknown. > > * @param[out] transaction > > * Where to store the opaque transaction information. > > * Used internally by the mempool library. > > * @return > > * - 1: Transaction completed; > > * '_finish' must not be called. > > * - 0: Transaction started; > > * '_finish' must be called to complete the transaction. > > * - <0: Error; failure code. > > */ > > static __rte_always_inline int > > rte_mempool_get/put_zc_bulk_start( > > struct rte_mempool *mp, > > void ***obj_table_ptr, > > unsigned int n, > > struct rte_mempool_cache *cache, > > rte_mempool_zc_transaction_state *transaction); > > > > /** Finish zero-copy get/put bulk transaction. > > * > > * @param[in] mp > > * Pointer to mempool. > > * @param[in] obj_table_ptr > > * Pointer to the zero-copy array of objects in the mempool, > > * returned by the 'start' function. > > * @param[in] n > > * Number of objects in the transaction. > > * Must be the same as for the 'start' function. > > * @param[in] transaction > > * Opaque transaction information, > > * returned by the 'start' function. > > * Used internally by the mempool library. > > */ > > static __rte_always_inline void > > rte_mempool_get/put_zc_bulk_finish( > > struct rte_mempool *mp, > > void **obj_table, > > unsigned int n, > > rte_mempool_zc_transaction_state *transaction); > > > > Note that these are *bulk* functions, so 'n' has to remain the same for a > 'finish' call as it was for the 'start' call of a transaction. > > > > And then the underlying backing stores would need to provide callbacks > that implement these functions, if they offer zero-copy functionality. > > > > The mempool implementation of these could start by checking for a > mempool cache, and use the "mempool cache" zero-copy if present. > > > > Some internal state information (from the mempool library or the > underlying mempool backing store) may need to be carried over from the > 'start' to the 'finish' function, so I have added a transaction state > parameter. > The transaction state must be held by the application for thread safety > reasons. Think of this like the 'flags' parameter to the Linux kernel's > spin_lock_irqsave/irqrestore() functions. > > > > We could omit the 'obj_table' and 'n' parameters from the 'finish' functions > and store them in the transaction state if needed; but we might possibly > achieve higher performance by passing them as parameters instead. > > > > > > > > > > > > > And if the PMD is unaware of the cache pointer, it can look it up > > > > at > > > runtime > > > > using rte_lcore_id(), like it does in the current Intel PMDs. > > > > > > > > > However, since the rte_mempool_cache_flush API is provided, may > > > > > be that decision is already done? Interestingly, > > > rte_mempool_cache_flush > > > > > is called by just a single PMD. > > > > > > > > I intentionally aligned this RFC with rte_mempool_cache_flush() to > > > maintain > > > > consistency. > > > > > > > > However, the API is not set in stone. It should always be > > > > acceptable > > > to > > > > consider improved alternatives. > > > > > > > > > > > > > > So, the question is, should we allow zero-copy only for per-core > > > cache > > > > > or for other cases as well. > > > > > > > > I suppose that the mempool library was designed to have a mempool > > > > associated with exactly one mempool cache per core. > > > > (Alternatively, > > > the > > > > mempool can be configured with no mempool caches at all.) > > > > > > > > We should probably stay loyal to that design concept, and only > > > > allow > > > zero- > > > > copy for per-core cache. > > > > > > > > If you can come up with an example of the opposite, I would like > > > > to > > > explore > > > > that option too... I can't think of a good example myself, and > > > perhaps I'm > > > > overlooking a relevant use case. > > > The use case I am talking about is the pipeline mode as I mentioned > > > above. Let me know if you agree. > > > > I see what you mean, and I don't object. :-) > > > > However, I still think the "mempool cache" zero-copy functions could be > useful. > > > > They would be needed for the generic "mempool" zero-copy functions > anyway. > > > > And the "mempool cache" zero-copy functions are much simpler to design, > implement and use than the "mempool" zero-copy functions, so it is a good > first step. > > > I would think that even in pipeline mode applications a mempool cache > would still be very useful, as it would like reduce the number of accesses to > the underlying shared ring or stack resources. > > For example, if, as is common, the application works in bursts of up to 32, > but > the mempool cache was configured as 128, it means that the TX side would > flush buffers to the shared pool at most every 4 bursts and likely less > frequently than that due to the bursts themselves not always being the full > 32. Since accesses to the underlying ring and stack tend to be slow due to > locking or atomic operations, the more you can reduce the accesses the > better. > > Therefore, I would very much consider use of a mempool without a cache as > an edge-case - but one we need to support, but not optimize for, since > mempool accesses without cache would already be rather slow. Understood. Looks like supporting APIs for per-core cache is enough for now. I do not see a lot of advantages for mempool without a cache.
> > My 2c here. > /Bruce