11/10/2024 10:04, Mattias Rönnblom: > On 2024-10-10 23:24, Thomas Monjalon wrote: > > Hello, > > > > This new feature looks to bring something interesting to DPDK. > > There was a good amount of discussion and review, > > and there is a real effort of documentation. > > > > However, some choices done in this implementation > > were not explained or advertised enough in the documentation, > > in my opinion. > > > > Are those of relevance to the API user?
I think it helps to understand when we should use this API. Such design explanation may come in the prog guide RST file. > > I think the first thing to add is an explanation of the memory layout. > > Maybe that a SVG drawing would help to show how it is stored. > > That would be helpful to someone wanting to understand the internals. > But where should that go? If it's put in the API, it will also obscure > the *actual* API documentation. Of course not in API doc. I'm talking about moving a lot of explanations in the prog guide, and add a bit more about the layout. > I have some drawings already, and I agree they are very helpful - both > in explaining how things work, and making obvious why the memory layout > resulting from the use of lcore variables are superior to that of the > lcore id-index static array approach. Cool, please add some in the prog guide. > > We also need to explain why it is not using rte_malloc. > > > > Also please could you re-read the doc and comments in detail? > > I think some words are missing and there are typos. > > While at it, please allow for easy update of the text > > by starting each sentence on a new line. > > Breaking lines logically is better for future patches. > > One more advice: avoid very long sentences. > > I've gone through the documentation and will post a new patch set. OK thanks. > There's been a lot of comments and discussion on this patch set. Did you > have anything in particular in mind? Nothing more than what I raised in this review. > > Do you have benchmarks results of the modules using such variables > > (power, random, service)? > > It would be interesting to compare time efficiency and memory usage > > before/after, with different number of threads. > > > > I have the dummy modules of test_lcore_var_perf.c, which show the > performance benefits as the number of modules using lcore variables > increases. > > That said, the gains are hard to quantify with micro benchmarks, and for > real-world performance, one really has to start using the facility at > scale before anything interesting may happen. > > Keep in mind however, that while this is new to DPDK, similar facilities > already exists your favorite UN*X kernel. The implementation is > different, but I think it's accurate to say the goal and the effects > should be the same. > > One can also run the perf autotest for RTE random, but such tests only > show lcore variables doesn't make things significantly worse when the L1 > cache is essentially unused. (In fact, the lcore variable-enabled > rte_random.c somewhat counter-intuitively generates a 64-bit number 1 > TSC cycle faster than the old version on my system.) > > Just to be clear: it's the footprint in the core-private caches we are > attempting to reduce. OK > > 10/10/2024 16:21, Mattias Rönnblom: > >> Introduce DPDK per-lcore id variables, or lcore variables for short. > >> > >> An lcore variable has one value for every current and future lcore > >> id-equipped thread. > > > > I find it difficult to read "lcore id-equipped thread". > > Can we just say "DPDK thread"? > > Sure, if you point me to a definition of what a DPDK thread is. > > I can think of at least four potential definitions > * An EAL thread > * An EAL thread or a registered non-EAL thread > * Any thread calling into DPDK APIs > * Any thread living in a DPDK process OK I understand your point. If we move the design explanations in the prog guide, we can explain this point in the introduction of the chapter. > > [...] > >> +An application, a DPDK library or PMD may keep opt to keep per-thread > >> +state. > > > > I don't understand this sentence. > > Which part is unclear? "keep opt to keep per-thread" What do you mean? [...] > >> +Variables with thread-local storage are allocated at the time of > >> +thread creation, and exists until the thread terminates, for every > >> +thread in the process. Only very small object should be allocated in > >> +TLS, since large TLS objects significantly slows down thread creation > >> +and may needlessly increase memory footprint for application that make > >> +extensive use of unregistered threads. > > > > I don't understand the relation with non-DPDK threads. > > __thread isn't just for "DPDK threads". It will allocate memory on all > threads in the process. OK May be good to add as a note. > > [...] > >> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE) > > > > With #define RTE_MAX_LCORE_VAR 1048576, > > LCORE_BUFFER_SIZE can be 100MB, right? > > > > Sure. Unless you mlock the memory, it won't result in the DPDK process > having 100MB worth of mostly-unused resident memory (RSS, in Linux > speak). It would, would we use huge pages and thus effectively disabled > demand paging. > > This is similar to how thread stacks generally work, where you often get > a fairly sizable stack (e.g., 2MB) but as long as you don't use all of > it, most of the pages won't be resident. > > If you want to guard against such mlocked scenarios, you could consider > lowering the max variable size. You could argue it's strange to have a > large RTE_MAX_LCORE_VAR and yet tell the API user to only use it for > small, often-used block of memory. > > If RTE_MAX_LCORE_VAR should have a different value, what should it be? That's fine > >> +static void *lcore_buffer; > > > > It is the last buffer for all lcores. > > The name suggests it is one single buffer per lcore. > > What about "last_buffer" or "current_buffer"? > > Would "value_buffer" be better? Or "values_buffer", although that sounds > awkward. "current_value_buffer". > > I agree lcore_buffer is very generic. > > The buffer holds values for all lcore ids, for one or (usually many) > more lcore variables. So you don't need to mention "lcore" in this variable. The most important is that it is the last buffer allocated IMHO. > >> +static size_t offset = RTE_MAX_LCORE_VAR; > > > > A comment may be useful for this value: it triggers the first alloc? > > Yes. I will add a comment. > > >> + > >> +static void * > >> +lcore_var_alloc(size_t size, size_t align) > >> +{ > >> + void *handle; > >> + unsigned int lcore_id; > >> + void *value; > >> + > >> + offset = RTE_ALIGN_CEIL(offset, align); > >> + > >> + if (offset + size > RTE_MAX_LCORE_VAR) { > >> +#ifdef RTE_EXEC_ENV_WINDOWS > >> + lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE, > >> + RTE_CACHE_LINE_SIZE); > >> +#else > >> + lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE, > >> + LCORE_BUFFER_SIZE); > >> +#endif > >> + RTE_VERIFY(lcore_buffer != NULL); > > > > Please no panic in a lib. > > You can return NULL. > > One could, but it would be a great cost to the API user. > > Something is seriously broken if these kind of allocations fail > (considering when they occur and what size they are), just like > something is seriously broken if the kernel fails (or is unwilling to) > allocate pages used by static lcore id index arrays. As said in another email, we may return NULL in this function and have RTE_VERIFY in case of declarations for ease of such API. So the user has a choice to use an API which returns an error or a simpler one with macros. > > [...] > >> +#ifndef _RTE_LCORE_VAR_H_ > >> +#define _RTE_LCORE_VAR_H_ > > > > Really we don't need the first and last underscores, > > but it's a detail. > > I just follow the DPDK conventions here. > > I agree the conventions are wrong. Such conventions are not consistent. Let's do the right thing. > >> + > >> +/** > >> + * @file > >> + * > >> + * RTE Lcore variables > > > > Please don't say "RTE", it is just a prefix. > > OK. > > I just follow the DPDK conventions here as well, but sure, I'll change it. Not really a convention. > > You can replace it with "DPDK" if you really want to be specific. > > > >> + * > >> + * This API provides a mechanism to create and access per-lcore id > >> + * variables in a space- and cycle-efficient manner. > >> + * > >> + * A per-lcore id variable (or lcore variable for short) has one value > >> + * for each EAL thread and registered non-EAL thread. There is one > >> + * instance for each current and future lcore id-equipped thread, with > >> + * a total of RTE_MAX_LCORE instances. The value of an lcore variable > >> + * for a particular lcore id is independent from other values (for > >> + * other lcore ids) within the same lcore variable. > >> + * > >> + * In order to access the values of an lcore variable, a handle is > >> + * used. The type of the handle is a pointer to the value's type > >> + * (e.g., for an @c uint32_t lcore variable, the handle is a > >> + * <code>uint32_t *</code>. The handle type is used to inform the > >> + * access macros the type of the values. A handle may be passed > >> + * between modules and threads just like any pointer, but its value > >> + * must be treated as a an opaque identifier. An allocated handle > >> + * never has the value NULL. > > > > Most of the explanations here would be better hosted in the prog guide. > > The Doxygen API is better suited for short and direct explanations. > > Yeah, maybe. Reworking this to the programming guide format and having > that reviewed is a sizable undertaking though. It is mostly a matter of moving text. I'm on it, I can review quickly. > >> + * > >> + * @b Creation > >> + * > >> + * An lcore variable is created in two steps: > >> + * 1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE. > >> + * 2. Allocate lcore variable storage and initialize the handle with > >> + * a unique identifier by @ref RTE_LCORE_VAR_ALLOC or > >> + * @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of > > > > *at* the time > > > >> + * module initialization, but may be done at any time. > > > > You mean it does not depend on EAL initialization? > > Lcore variables may be used prior to any other parts of the EAL have > been initialized. Please make it explicit. > >> + * > >> + * An lcore variable is not tied to the owning thread's lifetime. It's > >> + * available for use by any thread immediately after having been > >> + * allocated, and continues to be available throughout the lifetime of > >> + * the EAL. > >> + * > >> + * Lcore variables cannot and need not be freed. > > > > I'm curious about that. > > If EAL is closed, and the application continues its life, > > then we want all this memory to be cleaned as well. > > Do you know rte_eal_cleanup()? > > I think the primary reason you would like to free the buffers is to > avoid false positives from tools like valgrind memcheck (if anyone > managed to get that working with DPDK). > > rte_eal_cleanup() freeing the buffers and resetting the offset would > make sense. That however would require the buffers to be tracked (e.g., > as a linked list). > > From a footprint point of view, TLS allocations and static arrays also > aren't freed by rte_eal_cleanup(). They are not dynamic as this one. I still think it is required. Think about an application starting and stopping some DPDK modules, It would be a serious leakage. > >> + * @b Access > >> + * > >> + * The value of any lcore variable for any lcore id may be accessed > >> + * from any thread (including unregistered threads), but it should > >> + * only be *frequently* read from or written to by the owner. > > > > Would be interesting to explain why. > > This is intended to be brief and false sharing is mentioned elsewhere. > > >> + * > >> + * Values of the same lcore variable but owned by two different lcore > >> + * ids may be frequently read or written by the owners without risking > >> + * false sharing. > > > > Again you could explain why if you explained the storage layout. > > What is the minimum object size to avoid false sharing? > > Your objects may be as small as you want, and you still do not risk > false sharing. All objects for a particular lcore id are grouped > together, spatially. [...] > >> + * are allocated from the libc heap. Heap allocation failures are > >> + * treated as fatal. > > > > Why not handling as an error, so the app has a chance to cleanup before > > crash? > > > > Because you don't want to put the burden on the user (app or > DPDK-internal) to attempt to clean up such failures, which in practice > will never occur, and in case they do, they are just among several such > early-memory-allocation failures where the application code has no say > in what should occur. > > What happens if the TLS allocations are so large, the main thread can't > be created? > > What happens if the BSS section is so large (because of all our > RTE_MAX_LCORE-sized arrays) so its pages can't be made resident in memory? > > Lcore variables aren't a dynamic allocation facility. I understand that and I agree. In case someone is using it as a dynamic facility with the function, can we offer them a NULL return? [...] > >> +/** > >> + * Allocate space for an lcore variable, and initialize its handle. > >> + * > >> + * The values of the lcore variable are initialized to zero. > > > > The lcore variables are initialized to zero, not the values. > > > > "The lcore variables are initialized to zero" is the same as "The lcore > variables' values are initialized to zero" in my world, since the only > thing that can be initialized in a lcore variable is its values (or > "value instances" or just "instances", not sure I'm consistent here). OK > > Don't you mention 0 in align? > > I don't understand the question. Are you asking why objects are > worst-case aligned when RTE_LCORE_VAR_ALLOC_SIZE() is used? Rather than > naturally aligned? No I just mention that 0 align value is not documented here. > Good question, in that case. I guess it would make more sense if they > were naturally aligned. I just thought in terms of malloc() semantics, > but maybe that's wrong. [...] > >> +#define RTE_LCORE_VAR_INIT(name) \ > >> + RTE_INIT(rte_lcore_var_init_ ## name) \ > >> + { \ > >> + RTE_LCORE_VAR_ALLOC(name); \ > >> + } > > > > I don't get the need for RTE_INIT macros. > > Check rte_power_intrinsics.c I'll check later. > I agree it's not obvious they are worth the API clutter. > > > It does not cover RTE_INIT_PRIO and anyway > > another RTE_INIT is probably already there in the module. > >> + */ > >> +static inline void * > >> +rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle) > > > > What a long name! > > What about rte_lcore_var() ? > > > > It's long but consistent with the rest of the API. > > This is not a function you will be see called often in API user code. > Most will use the access macros. I let you discuss naming with Morten. It seems he agrees with me about making it short.