11/10/2024 10:04, Mattias Rönnblom:
> On 2024-10-10 23:24, Thomas Monjalon wrote:
> > Hello,
> > 
> > This new feature looks to bring something interesting to DPDK.
> > There was a good amount of discussion and review,
> > and there is a real effort of documentation.
> > 
> > However, some choices done in this implementation
> > were not explained or advertised enough in the documentation,
> > in my opinion.
> > 
> 
> Are those of relevance to the API user?

I think it helps to understand when we should use this API.
Such design explanation may come in the prog guide RST file.


> > I think the first thing to add is an explanation of the memory layout.
> > Maybe that a SVG drawing would help to show how it is stored.
> 
> That would be helpful to someone wanting to understand the internals. 
> But where should that go? If it's put in the API, it will also obscure 
> the *actual* API documentation.

Of course not in API doc.
I'm talking about moving a lot of explanations in the prog guide,
and add a bit more about the layout.

> I have some drawings already, and I agree they are very helpful - both 
> in explaining how things work, and making obvious why the memory layout 
> resulting from the use of lcore variables are superior to that of the 
> lcore id-index static array approach.

Cool, please add some in the prog guide.

> > We also need to explain why it is not using rte_malloc.
> > 
> > Also please could you re-read the doc and comments in detail?
> > I think some words are missing and there are typos.
> > While at it, please allow for easy update of the text
> > by starting each sentence on a new line.
> > Breaking lines logically is better for future patches.
> > One more advice: avoid very long sentences.
> 
> I've gone through the documentation and will post a new patch set.

OK thanks.

> There's been a lot of comments and discussion on this patch set. Did you 
> have anything in particular in mind?

Nothing more than what I raised in this review.


> > Do you have benchmarks results of the modules using such variables
> > (power, random, service)?
> > It would be interesting to compare time efficiency and memory usage
> > before/after, with different number of threads.
> > 
> 
> I have the dummy modules of test_lcore_var_perf.c, which show the 
> performance benefits as the number of modules using lcore variables 
> increases.
> 
> That said, the gains are hard to quantify with micro benchmarks, and for 
> real-world performance, one really has to start using the facility at 
> scale before anything interesting may happen.
> 
> Keep in mind however, that while this is new to DPDK, similar facilities 
> already exists your favorite UN*X kernel. The implementation is 
> different, but I think it's accurate to say the goal and the effects 
> should be the same.
> 
> One can also run the perf autotest for RTE random, but such tests only 
> show lcore variables doesn't make things significantly worse when the L1 
> cache is essentially unused. (In fact, the lcore variable-enabled 
> rte_random.c somewhat counter-intuitively generates a 64-bit number 1 
> TSC cycle faster than the old version on my system.)
> 
> Just to be clear: it's the footprint in the core-private caches we are 
> attempting to reduce.

OK


> > 10/10/2024 16:21, Mattias Rönnblom:
> >> Introduce DPDK per-lcore id variables, or lcore variables for short.
> >>
> >> An lcore variable has one value for every current and future lcore
> >> id-equipped thread.
> > 
> > I find it difficult to read "lcore id-equipped thread".
> > Can we just say "DPDK thread"?
> 
> Sure, if you point me to a definition of what a DPDK thread is.
> 
> I can think of at least four potential definitions
> * An EAL thread
> * An EAL thread or a registered non-EAL thread
> * Any thread calling into DPDK APIs
> * Any thread living in a DPDK process

OK I understand your point.
If we move the design explanations in the prog guide,
we can explain this point in the introduction of the chapter.


> > [...]
> >> +An application, a DPDK library or PMD may keep opt to keep per-thread
> >> +state.
> > 
> > I don't understand this sentence.
> 
> Which part is unclear?

"keep opt to keep per-thread"
What do you mean?


[...]
> >> +Variables with thread-local storage are allocated at the time of
> >> +thread creation, and exists until the thread terminates, for every
> >> +thread in the process. Only very small object should be allocated in
> >> +TLS, since large TLS objects significantly slows down thread creation
> >> +and may needlessly increase memory footprint for application that make
> >> +extensive use of unregistered threads.
> > 
> > I don't understand the relation with non-DPDK threads.
> 
> __thread isn't just for "DPDK threads". It will allocate memory on all 
> threads in the process.

OK
May be good to add as a note.


> > [...]
> >> +#define LCORE_BUFFER_SIZE (RTE_MAX_LCORE_VAR * RTE_MAX_LCORE)
> > 
> > With #define RTE_MAX_LCORE_VAR 1048576,
> > LCORE_BUFFER_SIZE can be 100MB, right?
> > 
> 
> Sure. Unless you mlock the memory, it won't result in the DPDK process 
> having 100MB worth of mostly-unused resident memory (RSS, in Linux 
> speak). It would, would we use huge pages and thus effectively disabled 
> demand paging.
> 
> This is similar to how thread stacks generally work, where you often get 
> a fairly sizable stack (e.g., 2MB) but as long as you don't use all of 
> it, most of the pages won't be resident.
> 
> If you want to guard against such mlocked scenarios, you could consider 
> lowering the max variable size. You could argue it's strange to have a 
> large RTE_MAX_LCORE_VAR and yet tell the API user to only use it for 
> small, often-used block of memory.
> 
> If RTE_MAX_LCORE_VAR should have a different value, what should it be?

That's fine


> >> +static void *lcore_buffer;
> > 
> > It is the last buffer for all lcores.
> > The name suggests it is one single buffer per lcore.
> > What about "last_buffer" or "current_buffer"?
> 
> Would "value_buffer" be better? Or "values_buffer", although that sounds 
> awkward. "current_value_buffer".
> 
> I agree lcore_buffer is very generic.
> 
> The buffer holds values for all lcore ids, for one or (usually many) 
> more lcore variables.

So you don't need to mention "lcore" in this variable.
The most important is that it is the last buffer allocated IMHO.


> >> +static size_t offset = RTE_MAX_LCORE_VAR;
> > 
> > A comment may be useful for this value: it triggers the first alloc?
> 
> Yes. I will add a comment.
> 
> >> +
> >> +static void *
> >> +lcore_var_alloc(size_t size, size_t align)
> >> +{
> >> +  void *handle;
> >> +  unsigned int lcore_id;
> >> +  void *value;
> >> +
> >> +  offset = RTE_ALIGN_CEIL(offset, align);
> >> +
> >> +  if (offset + size > RTE_MAX_LCORE_VAR) {
> >> +#ifdef RTE_EXEC_ENV_WINDOWS
> >> +          lcore_buffer = _aligned_malloc(LCORE_BUFFER_SIZE,
> >> +                                         RTE_CACHE_LINE_SIZE);
> >> +#else
> >> +          lcore_buffer = aligned_alloc(RTE_CACHE_LINE_SIZE,
> >> +                                       LCORE_BUFFER_SIZE);
> >> +#endif
> >> +          RTE_VERIFY(lcore_buffer != NULL);
> > 
> > Please no panic in a lib.
> > You can return NULL.
> 
> One could, but it would be a great cost to the API user.
> 
> Something is seriously broken if these kind of allocations fail 
> (considering when they occur and what size they are), just like 
> something is seriously broken if the kernel fails (or is unwilling to) 
> allocate pages used by static lcore id index arrays.

As said in another email,
we may return NULL in this function and have RTE_VERIFY
in case of declarations for ease of such API.
So the user has a choice to use an API which returns an error
or a simpler one with macros.


> > [...]
> >> +#ifndef _RTE_LCORE_VAR_H_
> >> +#define _RTE_LCORE_VAR_H_
> > 
> > Really we don't need the first and last underscores,
> > but it's a detail.
> 
> I just follow the DPDK conventions here.
> 
> I agree the conventions are wrong.

Such conventions are not consistent. Let's do the right thing.

> >> +
> >> +/**
> >> + * @file
> >> + *
> >> + * RTE Lcore variables
> > 
> > Please don't say "RTE", it is just a prefix.
> 
> OK.
> 
> I just follow the DPDK conventions here as well, but sure, I'll change it.

Not really a convention.

> > You can replace it with "DPDK" if you really want to be specific.
> > 
> >> + *
> >> + * This API provides a mechanism to create and access per-lcore id
> >> + * variables in a space- and cycle-efficient manner.
> >> + *
> >> + * A per-lcore id variable (or lcore variable for short) has one value
> >> + * for each EAL thread and registered non-EAL thread. There is one
> >> + * instance for each current and future lcore id-equipped thread, with
> >> + * a total of RTE_MAX_LCORE instances. The value of an lcore variable
> >> + * for a particular lcore id is independent from other values (for
> >> + * other lcore ids) within the same lcore variable.
> >> + *
> >> + * In order to access the values of an lcore variable, a handle is
> >> + * used. The type of the handle is a pointer to the value's type
> >> + * (e.g., for an @c uint32_t lcore variable, the handle is a
> >> + * <code>uint32_t *</code>. The handle type is used to inform the
> >> + * access macros the type of the values. A handle may be passed
> >> + * between modules and threads just like any pointer, but its value
> >> + * must be treated as a an opaque identifier. An allocated handle
> >> + * never has the value NULL.
> > 
> > Most of the explanations here would be better hosted in the prog guide.
> > The Doxygen API is better suited for short and direct explanations.
> 
> Yeah, maybe. Reworking this to the programming guide format and having 
> that reviewed is a sizable undertaking though.

It is mostly a matter of moving text.
I'm on it, I can review quickly.

> >> + *
> >> + * @b Creation
> >> + *
> >> + * An lcore variable is created in two steps:
> >> + *  1. Define an lcore variable handle by using @ref RTE_LCORE_VAR_HANDLE.
> >> + *  2. Allocate lcore variable storage and initialize the handle with
> >> + *     a unique identifier by @ref RTE_LCORE_VAR_ALLOC or
> >> + *     @ref RTE_LCORE_VAR_INIT. Allocation generally occurs the time of
> > 
> > *at* the time
> > 
> >> + *     module initialization, but may be done at any time.
> > 
> > You mean it does not depend on EAL initialization?
> 
> Lcore variables may be used prior to any other parts of the EAL have 
> been initialized.

Please make it explicit.

> >> + *
> >> + * An lcore variable is not tied to the owning thread's lifetime. It's
> >> + * available for use by any thread immediately after having been
> >> + * allocated, and continues to be available throughout the lifetime of
> >> + * the EAL.
> >> + *
> >> + * Lcore variables cannot and need not be freed.
> > 
> > I'm curious about that.
> > If EAL is closed, and the application continues its life,
> > then we want all this memory to be cleaned as well.
> > Do you know rte_eal_cleanup()?
> 
> I think the primary reason you would like to free the buffers is to 
> avoid false positives from tools like valgrind memcheck (if anyone 
> managed to get that working with DPDK).
> 
> rte_eal_cleanup() freeing the buffers and resetting the offset would 
> make sense. That however would require the buffers to be tracked (e.g., 
> as a linked list).
> 
>  From a footprint point of view, TLS allocations and static arrays also 
> aren't freed by rte_eal_cleanup().

They are not dynamic as this one.

I still think it is required.
Think about an application starting and stopping some DPDK modules,
It would be a serious leakage.


> >> + * @b Access
> >> + *
> >> + * The value of any lcore variable for any lcore id may be accessed
> >> + * from any thread (including unregistered threads), but it should
> >> + * only be *frequently* read from or written to by the owner.
> > 
> > Would be interesting to explain why.
> 
> This is intended to be brief and false sharing is mentioned elsewhere.
> 
> >> + *
> >> + * Values of the same lcore variable but owned by two different lcore
> >> + * ids may be frequently read or written by the owners without risking
> >> + * false sharing.
> > 
> > Again you could explain why if you explained the storage layout.
> > What is the minimum object size to avoid false sharing?
> 
> Your objects may be as small as you want, and you still do not risk 
> false sharing. All objects for a particular lcore id are grouped 
> together, spatially.

[...]
> >> + * are allocated from the libc heap. Heap allocation failures are
> >> + * treated as fatal.
> > 
> > Why not handling as an error, so the app has a chance to cleanup before 
> > crash?
> > 
> 
> Because you don't want to put the burden on the user (app or 
> DPDK-internal) to attempt to clean up such failures, which in practice 
> will never occur, and in case they do, they are just among several such 
> early-memory-allocation failures where the application code has no say 
> in what should occur.
> 
> What happens if the TLS allocations are so large, the main thread can't 
> be created?
> 
> What happens if the BSS section is so large (because of all our 
> RTE_MAX_LCORE-sized arrays) so its pages can't be made resident in memory?
> 
> Lcore variables aren't a dynamic allocation facility.

I understand that and I agree.
In case someone is using it as a dynamic facility with the function,
can we offer them a NULL return?

[...]
> >> +/**
> >> + * Allocate space for an lcore variable, and initialize its handle.
> >> + *
> >> + * The values of the lcore variable are initialized to zero.
> > 
> > The lcore variables are initialized to zero, not the values.
> > 
> 
> "The lcore variables are initialized to zero" is the same as "The lcore 
> variables' values are initialized to zero" in my world, since the only 
> thing that can be initialized in a lcore variable is its values (or 
> "value instances" or just "instances", not sure I'm consistent here).

OK

> > Don't you mention 0 in align?
> 
> I don't understand the question. Are you asking why objects are 
> worst-case aligned when RTE_LCORE_VAR_ALLOC_SIZE() is used? Rather than 
> naturally aligned?

No I just mention that 0 align value is not documented here.

> Good question, in that case. I guess it would make more sense if they 
> were naturally aligned. I just thought in terms of malloc() semantics, 
> but maybe that's wrong.

[...]
> >> +#define RTE_LCORE_VAR_INIT(name)                                  \
> >> +  RTE_INIT(rte_lcore_var_init_ ## name)                           \
> >> +  {                                                               \
> >> +          RTE_LCORE_VAR_ALLOC(name);                              \
> >> +  }
> > 
> > I don't get the need for RTE_INIT macros.
> 
> Check rte_power_intrinsics.c

I'll check later.

> I agree it's not obvious they are worth the API clutter.
> 
> > It does not cover RTE_INIT_PRIO and anyway
> > another RTE_INIT is probably already there in the module.


> >> + */
> >> +static inline void *
> >> +rte_lcore_var_lcore_ptr(unsigned int lcore_id, void *handle)
> > 
> > What a long name!
> > What about rte_lcore_var() ?
> > 
> 
> It's long but consistent with the rest of the API.
> 
> This is not a function you will be see called often in API user code. 
> Most will use the access macros.

I let you discuss naming with Morten.
It seems he agrees with me about making it short.



Reply via email to