On 1/8/2024 2:49 AM, Jaroslav Pulchart wrote:
Hello
First, thank you for your work trying to chase this!
I would like to report a regression triggered by recent change in
Intel ICE Ethernet driver in the 6.6.9 linux kernel. The problem was
bisected and the regression is triggered by
fc4d6d136d42fab207b3ce20a8ebfd61a13f931f "ice: alter feature support
check for SRIOV and LAG" commit and originally reported as part of
https://lore.kernel.org/linux-mm/cak8ffz4dy+gtba40pm7nn5xchy+51w3sfxpqkqpqaksxyyx...@mail.gmail.com/T/#m5217c62beb03b3bc75d7dd4b1d9bab64a3e68826
thread.
I think that's a bad bisect. There is no reason I could understand for
that change to cause a continuous or large leak, it really doesn't make
any sense. Reverting it consistently helps? You're not just rewinding
the tree back to that point, right? just running 6.6.9 without that
patch? (sorry for being pedantic, just trying to be certain)
Reverting just the single bisected commit continuously helps for >=
6.6.9 and as well for current 6.7.
We cannot use any new linux kernel without reverting it due to this
extra memory utilization.
However, after the following patch we see that more NUMA nodes have
such a low amount of memory and that is causing constant reclaiming
of memory because it looks like something inside of the kernel ate all
the memory. This is right after the start of the system as well.
I'm reporting it here as it is a different problem than the original
thread. The commit introduces a low memory problem per each numa node
of the first socket (node0 .. node3 in our case) and cause constant
kswapd* 100% CPU usage. See attached 6.6.9-kswapd_usage.png. The low
memory issue is nicely visible in "numastat -m", see attached files:
* numastat_m-6.6.10_28GB_HP_ice_revert.txt >= 6.6.9 with reverted ice commit
* numastat_m-6.6.10_28GB_HP_no_revert.txt >= 6.6.9 vanilla
the server "is fresh" (after reboot), without running any application load.
OK, so the initial allocations of your system is running your system out
of memory.
Are you running jumbo frames on your ethernet interfaces?
Yes, we are (MTU 9000).
Do you have /proc/slabinfo output from working/non-working boot?
Yes, I have a complete sos report so I can pick-up files from there.
See attached
slabinfo.vanila (non-working)
slabinfo.reverted (working)
$ grep MemFree numastat_m-6.6.10_28GB_HP_ice_revert.txt
numastat_m-6.6.10_28GB_HP_no_revert.txt
numastat_m-6.6.10_28GB_HP_ice_revert.txt:MemFree
2756.89 2754.86 100.39 2278.43 < ice
fix is reverted, we have ~2GB free per numa, except one, like before
== no issue
numastat_m-6.6.10_28GB_HP_ice_revert.txt:MemFree
3551.29 1530.52 2212.04 3488.09
...
numastat_m-6.6.10_28GB_HP_no_revert.txt:MemFree
127.52 66.49 120.23 263.47 <
ice fix is present, we see just few MB free per each node, this will
cause kswapd utilization!
numastat_m-6.6.10_28GB_HP_no_revert.txt:MemFree
3322.18 3134.47 195.55 879.17
...
If you have some hints on how to debug what is actually occupying all
that memory and some fix of the problem will be nice. We can provide
testing and more reports if needed to analyze the issue. We reverted
the commit fc4d6d136d42fab207b3ce20a8ebfd61a13f931f as a workaround
till we know a proper fix.
My first suspicion is that we're contributing to the problem by running
out of receive descriptors memory.
Can we see the ethtool -S stats from the freshly booted system that's
running out of memory or doing OOM? Also, all the standard debugging
info (at least once please), devlink dev info, any other configuration
specifics? What networking config (bonding? anything else?)
The system is not in OOM, it starts to continuously utilize four
kswapd0-4 of each numa node from the first CPU socket processes (each
at 100% and all doing swap in/out) after the system start to be used
by application due to "low memory".
We have two 25G 2P E810-XXV Adapters. The first port of each (em1 +
p3p1) is connected and they're bonded in LACP. Second ports (em2 and
p3p2) are unused.
See attached file for working:
ethtool_-S_em1.reverted
ethtool_-S_em2.reverted
ethtool_-S_p3p1.reverted
ethtool_-S_p3p2.reverted
See attached file for non-working:
ethtool_-S_em1.vanila
ethtool_-S_em2.vanila
ethtool_-S_p3p1.vanila
ethtool_-S_p3p2.vanila
Do you have a bugzilla.kernel.org bug yet where you can upload larger
files like dmesg and others?
I do not have yet, I will create a new one and ping you then.
Also, I'm curious if your problem goes away if you change / reduce the
number of queues per port. use ethtool -L eth0 combined 4 ?
I will try and give you feedback soon.
You also said something about reproducing when launching / destroying
virtual machines with VF passthrough?
The memory usage is there from boot without running any VMs. The issue
is that the host has low memory for self and it starts to use kswapd
when we start to use it by starting vms.
Can you reproduce the issue without starting qemu (just doing bare-metal
SR-IOV instance creation/destruction via
/sys/class/net/eth0/device/sriov_numvfs ?)
Yes we can reproduce it without qemu running, the extra memory usage
is from the beginning after boot, not depending on any running VM.
We do not use SR-IOV.
Thanks
Thanks,
Jaroslav Pulchart