On 2024-12-09 18:40, David Marchand wrote:
On Mon, Dec 9, 2024 at 4:39 PM Mattias Rönnblom <hof...@lysator.liu.se> wrote:
On 2024-12-09 12:03, David Marchand wrote:
On Fri, Dec 6, 2024 at 12:02 PM Mattias Rönnblom <hof...@lysator.liu.se> wrote:
On 2024-12-05 18:57, David Marchand wrote:
As I had reported in rc2, the lcore variables allocation have a
noticeable impact on applications consuming DPDK, even when such
applications does not use DPDK, or use features associated to
some lcore variables.

While the amount has been reduced in a rush before rc2,
there are still cases when the increased memory footprint is noticed
like in scaling tests.
See https://bugs.launchpad.net/ubuntu/+source/dpdk/+bug/2090931


What this bug report fails to mention is that it only affects
applications using locked memory.

- By locked memory, are you referring to mlock() and friends?
No ovsdb binary calls them, only the datapath cares about mlocking.


- At a minimum, I understand the lcore var change introduced an
increase in memory of 4kB * 128 (getpagesize() * RTE_MAX_LCORES),
since lcore_var_alloc() calls memset() of the lcore var size, for
every lcore.


Yes, that is my understanding. It's also consistent with the
measurements I've posted on this list.

In this unit test where 1000 processes are kept alive in parallel,
this means memory consumption increased by 512k * 1000, so ~500M at
least.
This amount of memory is probably significant in a resource-restrained
env like a (Ubuntu) CI.



I wouldn't expect thousands of concurrent processes in a
resource-constrained system. Sounds wasteful indeed. But sure, there may
well be scenarios where this make sense.

- I went and traced this unit tests on my laptop by monitoring
kmem:mm_page_alloc, though there may be a better metrics when it comes
to memory consumption.

# dir=build; perf stat -e kmem:mm_page_alloc -- tests/testsuite -C
$dir/tests 
AUTOTEST_PATH=$dir/utilities:$dir/vswitchd:$dir/ovsdb:$dir/vtep:$dir/tests:$dir/ipsec::
2154

Which gives:
- 1 635 489      kmem:mm_page_alloc for v23.11
- 5 777 043      kmem:mm_page_alloc for v24.11


Interesting. What is vm.overcommit_memory set to?

# cat /proc/sys/vm/overcommit_memory
0

And I am not sure what is being used in Ubuntu CI.

But the problem is, in the end, simpler.

[snip]


There is a 4M difference, where I would expect 128k.
So something more happens, than a simple page allocation per lcore,
though I fail to understand what.

Isolating the perf events for one process of this huge test, I counted
4878 page alloc calls.
 From them, 4108 had rte_lcore_var_alloc in their calling stack which
is unexpected.

After spending some time reading glibc, I noticed alloc_perturb().
*bingo*, I remembered that OVS unit tests are run with MALLOC_PERTURB_
(=165 after double checking OVS sources).

"""
Tunable: glibc.malloc.perturb

This tunable supersedes the MALLOC_PERTURB_ environment variable and
is identical in features.

If set to a non-zero value, memory blocks are initialized with values
depending on some low order bits of this tunable when they are
allocated (except when allocated by calloc) and freed. This can be
used to debug the use of uninitialized or freed heap memory. Note that
this option does not guarantee that the freed block will have any
specific values. It only guarantees that the content the block had
before it was freed will be overwritten.

The default value of this tunable is ‘0’.
"""


OK, excellent work, detective. :)

Do have a workaround for this issue, so that this test suite will work with vanilla DPDK 24.11? I guess OVS wants to keep the PERTURB settings.

The fix you've suggested will solve this issue for the no-DPDK-usage case. I'm guessing allocating the first lcore var block off of the BSS (e.g., via a static variable) would as well, in addition to solving similar cases but where there is "light" DPDK usage (i.e., rte_eal_init() is called, but with no real app).

Now, reproducing this out of the test:

$ perf stat -e kmem:mm_page_alloc -- ./build/ovsdb/ovsdb-client --help
/dev/null
  Performance counter stats for './build/ovsdb/ovsdb-client --help':
                810      kmem:mm_page_alloc
        0,003277941 seconds time elapsed
        0,003260000 seconds user
        0,000000000 seconds sys

$ MALLOC_PERTURB_=165 perf stat -e kmem:mm_page_alloc --
./build/ovsdb/ovsdb-client --help >/dev/null
  Performance counter stats for './build/ovsdb/ovsdb-client --help':
              4 789      kmem:mm_page_alloc
        0,008766171 seconds time elapsed
        0,000976000 seconds user
        0,007794000 seconds sys

So the issue is not triggered by mlock'd memory, but by the whole
buffer of 16M for lcore variables being touched by a glibc debugging
feature.
> And in Ubuntu CI, it translated to requesting 16G.



Btw, just focusing on lcore var, I did two more tests:
- 1 606 998      kmem:mm_page_alloc for v24.11 + revert all lcore var changes.
- 1 634 606      kmem:mm_page_alloc for v24.11 + current series with
postponed allocations.



If one move initialization to shared object constructors (from having
been at some later time), and then end up not running that
initialization code at all (e.g., DPDK is not used), those code pages
will increase RSS. That might well hurt more than the lcore variable
memory itself, depending on how much code is run.

However, such read-only pages can be replaced with something more useful
if the system is under memory pressure, so they aren't really a big
issue as far as (real) memory footprint is concerned.

Just linking to DPDK (and its dependencies) already came with a 1-7 MB
RSS penalty, prior to lcore variables. I wonder how much of that goes
away if all RTE_INIT() type constructors are removed.

Regardless of the RSS change, removing completely constructors is not simple.
Postponing *all* existing constructors from DPDK code would be an ABI
breakage, as RTE_INIT have a priority notion and an application
callbacks using RTE_INIT may rely on this.

Agreed.

Just deferring "unprioritised" constructors would be doable on paper,
but the location in rte_eal_init where those are is deferred would
have to be carefully evaluated (with -d plugins in mind).



It seems to me that a reworking of this area should have a bigger scope than just addressing this issue.

RTE_INIT() should probably be deprecated, and DPDK shouldn't encourage the use of shared-object level constructors.

For dynamically loaded modules (-d), there needs to be some kind of replacement, serving the same function.

There should probably be some way to hook into the initialization process (available also for apps), which should all happen at rte_eal_init() (or later).

Does the priority concept make sense? At least conceptually, the initialization should be based off a dependency graph (DAG).

You could reduce the priorities to a number of named stages (just like in FreeBSD or Linux). A minor tweak to the current model. However, in DPDK, it would be useful if a generic facility could be used by apps, and thus the number and names of the stages are open ended (unlike the UNIX kernels').

You could rely on explicit initialization alone, where each module initializes it's dependencies. That would lead to repeated init function calls on the same module, unless there's some init framework help from EAL to prevent that. Overall, that would lead to more code, where various higher-level modules needs to initialize many dependencies.

Maybe the DAG is available on the build (meson) level, and thus the code can be generated out of that?

Some random thoughts.

Reply via email to