so 5. 7. 2025 v 9:01 odesílatel Jaroslav Pulchart
<[email protected]> napsal:
>
> > On Mon, Apr 14, 2025 at 06:29:01PM +0200, Jaroslav Pulchart wrote:
> > > Hello,
> > >
> > > While investigating increased memory usage after upgrading our
> > > host/hypervisor servers from Linux kernel 6.12.y to 6.13.y, I observed
> > > a regression in available memory per NUMA node. Our servers allocate
> > > 60GB of each NUMA node’s 64GB of RAM to HugePages for VMs, leaving 4GB
> > > for the host OS.
> > >
> > > After the upgrade, we noticed approximately 500MB less free RAM on
> > > NUMA nodes 0 and 2 compared to 6.12.y, even with no VMs running (just
> > > the host OS after reboot). These nodes host Intel 810-XXV NICs. Here's
> > > a snapshot of the NUMA stats on vanilla 6.13.y:
> > >
> > >      NUMA nodes:  0     1     2     3     4     5     6     7     8
> > >  9    10    11    12    13    14    15
> > >      HPFreeGiB:   60    60    60    60    60    60    60    60    60
> > >  60   60    60    60    60    60    60
> > >      MemTotal:    64989 65470 65470 65470 65470 65470 65470 65453
> > > 65470 65470 65470 65470 65470 65470 65470 65462
> > >      MemFree:     2793  3559  3150  3438  3616  3722  3520  3547  3547
> > >  3536  3506  3452  3440  3489  3607  3729
> > >
> > > We traced the issue to commit 492a044508ad13a490a24c66f311339bf891cb5f
> > > "ice: Add support for persistent NAPI config".
> > >
> > > We limit the number of channels on the NICs to match local NUMA cores
> > > or less if unused interface (from ridiculous 96 default), for example:
> > >    ethtool -L em1 combined 6       # active port; from 96
> > >    ethtool -L p3p2 combined 2      # unused port; from 96
> > >
> > > This typically aligns memory use with local CPUs and keeps NUMA-local
> > > memory usage within expected limits. However, starting with kernel
> > > 6.13.y and this commit, the high memory usage by the ICE driver
> > > persists regardless of reduced channel configuration.
> > >
> > > Reverting the commit restores expected memory availability on nodes 0
> > > and 2. Below are stats from 6.13.y with the commit reverted:
> > >     NUMA nodes:  0     1     2     3     4     5     6     7     8
> > > 9    10    11    12    13    14    15
> > >     HPFreeGiB:   60    60    60    60    60    60    60    60    60
> > > 60   60    60    60    60    60    60
> > >     MemTotal:    64989 65470 65470 65470 65470 65470 65470 65453 65470
> > > 65470 65470 65470 65470 65470 65470 65462
> > >     MemFree:     3208  3765  3668  3507  3811  3727  3812  3546  3676  
> > > 3596 ...
> > >
> > > This brings nodes 0 and 2 back to ~3.5GB free RAM, similar to kernel
> > > 6.12.y, and avoids swap pressure and memory exhaustion when running
> > > services and VMs.
> > >
> > > I also do not see any practical benefit in persisting the channel
> > > memory allocation. After a fresh server reboot, channels are not
> > > explicitly configured, and the system will not automatically resize
> > > them back to a higher count unless manually set again. Therefore,
> > > retaining the previous memory footprint appears unnecessary and
> > > potentially harmful in memory-constrained environments
> > >
> > > Best regards,
> > > Jaroslav Pulchart
> > >
> >
> >
> > Hello Jaroslav,
> >
> > I have just sent a series for converting the Rx path of the ice driver
> > to use the Page Pool.
> > We suspect it may help for the memory consumption issue since it removes
> > the problematic code and delegates some memory management to the generic
> > code.
> >
> > Could you please give it a try and check if it helps for your issue.
> > The link to the series: 
> > https://lore.kernel.org/intel-wired-lan/[email protected]/
>
> I can try it, however I cannot apply the patch as-is @ 6.15.y:
> $ git am ~/ice-convert-Rx-path-to-Page-Pool.patch
> Applying: ice: remove legacy Rx and construct SKB
> Applying: ice: drop page splitting and recycling
> error: patch failed: drivers/net/ethernet/intel/ice/ice_txrx.h:480
> error: drivers/net/ethernet/intel/ice/ice_txrx.h: patch does not apply
> Patch failed at 0002 ice: drop page splitting and recycling
> hint: Use 'git am --show-current-patch=diff' to see the failed patch
> hint: When you have resolved this problem, run "git am --continue".
> hint: If you prefer to skip this patch, run "git am --skip" instead.
> hint: To restore the original branch and stop patching, run "git am --abort".
> hint: Disable this message with "git config set advice.mergeConflict false"
>

My colleague and I have applied the missing bits and have it building
on 6.15.5 (note that we had to disable CONFIG_MEM_ALLOC_PROFILING, or
the kernel won’t boot). The patches we used are:

0001-libeth-convert-to-netmem.patch
0002-libeth-support-native-XDP-and-register-memory-model.patch
0003-libeth-xdp-add-XDP_TX-buffers-sending.patch
0004-libeth-xdp-add-.ndo_xdp_xmit-helpers.patch
0005-libeth-xdp-add-XDPSQE-completion-helpers.patch
0006-libeth-xdp-add-XDPSQ-locking-helpers.patch
0007-libeth-xdp-add-XDPSQ-cleanup-timers.patch
0008-libeth-xdp-add-helpers-for-preparing-processing-libe.patch
0009-libeth-xdp-add-XDP-prog-run-and-verdict-result-handl.patch
0010-libeth-xdp-add-templates-for-building-driver-side-ca.patch
0011-libeth-xdp-add-RSS-hash-hint-and-XDP-features-setup-.patch
0012-libeth-xsk-add-XSk-XDP_TX-sending-helpers.patch
0013-libeth-xsk-add-XSk-xmit-functions.patch
0014-libeth-xsk-add-XSk-Rx-processing-support.patch
0015-libeth-xsk-add-XSkFQ-refill-and-XSk-wakeup-helpers.patch
0016-libeth-xdp-xsk-access-adjacent-u32s-as-u64-where-app.patch
0017-ice-add-a-separate-Rx-handler-for-flow-director-comm.patch
0018-ice-remove-legacy-Rx-and-construct-SKB.patch
0019-ice-drop-page-splitting-and-recycling.patch
0020-ice-switch-to-Page-Pool.patch

Unfortunately, the new setup crashes after VMs are started. Here’s the
oops trace:

[   82.816544] tun: Universal TUN/TAP device driver, 1.6
[   82.823923] tap2c2b8dfc-91: entered promiscuous mode
[   82.848913] tapa92181fc-b5: entered promiscuous mode
[   84.030527] tap54ab9888-90: entered promiscuous mode
[   84.043251] tap89f4f7ae-d1: entered promiscuous mode
[   85.768578] tapf1e9f4f9-17: entered promiscuous mode
[   85.780372] tap72c64909-77: entered promiscuous mode
[   87.580455] tape1b2d2dd-bc: entered promiscuous mode
[   87.593224] tap34fb2668-4a: entered promiscuous mode
[  150.406899] Oops: general protection fault, probably for
non-canonical address 0xffff3b95e757d5a0: 0000 [#1] SMP NOPTI
[  150.417626] CPU: 4 UID: 0 PID: 0 Comm: swapper/4 Tainted: G
   E       6.15.5-1.gdc+ice.el9.x86_64 #1 PREEMPT(lazy)
[  150.428845] Tainted: [E]=UNSIGNED_MODULE
[  150.432773] Hardware name: Dell Inc. PowerEdge R7525/0H3K7P, BIOS
2.19.0 03/07/2025
[  150.440432] RIP: 0010:page_pool_put_unrefed_netmem+0xe2/0x250
[  150.446186] Code: 18 48 85 d2 0f 84 58 ff ff ff 8b 52 2c 4c 89 e7
39 d0 41 0f 94 c5 e8 0d f2 ff ff 84 c0 0f 85 4f ff ff ff 48 8b 85 60
06 00 00 <65> 48 ff 40 20 5b 4c 89 e6 48 89 ef 5d 41 5c 41 5d e9 f8 fa
ff ff
[  150.464947] RSP: 0018:ffffbc4a003fcd18 EFLAGS: 00010246
[  150.470173] RAX: ffff9dcabfc37580 RBX: 00000000ffffffff RCX: 0000000000000000
[  150.477496] RDX: 0000000000000000 RSI: fffff2ec441924c0 RDI: fffff2ec441924c0
[  150.484773] RBP: ffff9dcabfc36f20 R08: ffff9dc330536d20 R09: 0000000000551618
[  150.492045] R10: 0000000000000000 R11: 0000000000000f82 R12: fffff2ec441924c0
[  150.499317] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000001b69
[  150.506584] FS:  0000000000000000(0000) GS:ffff9dcb27946000(0000)
knlGS:0000000000000000
[  150.514806] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  150.520677] CR2: 00007f82d00041b8 CR3: 000000012bcab00a CR4: 0000000000770ef0
[  150.527937] PKRU: 55555554
[  150.530770] Call Trace:
[  150.533342]  <IRQ>
[  150.535484]  ice_clean_rx_irq+0x288/0x530 [ice]
[  150.540171]  ? sched_balance_find_src_group+0x13f/0x210
[  150.545521]  ? ice_clean_tx_irq+0x18f/0x3a0 [ice]
[  150.550373]  ice_napi_poll+0xe2/0x290 [ice]
[  150.554709]  __napi_poll+0x27/0x1e0
[  150.558323]  net_rx_action+0x1d3/0x3f0
[  150.562194]  ? __napi_schedule+0x8e/0xb0
[  150.566239]  ? sched_clock+0xc/0x30
[  150.569852]  ? sched_clock_cpu+0xb/0x190
[  150.573897]  handle_softirqs+0xd0/0x2b0
[  150.577858]  __irq_exit_rcu+0xcd/0xf0
[  150.581636]  common_interrupt+0x7f/0xa0
[  150.585601]  </IRQ>
[  150.587826]  <TASK>
[  150.590049]  asm_common_interrupt+0x22/0x40
[  150.594352] RIP: 0010:flush_smp_call_function_queue+0x39/0x50
[  150.600218] Code: 80 c0 bb 2e 98 48 85 c0 74 31 53 9c 5b fa bf 01
00 00 00 e8 49 f5 ff ff 65 66 83 3d 58 af 90 02 00 75 0c 80 e7 02 74
01 fb 5b <c3> cc cc cc cc e8 8d 1d f1 ff 80 e7 02 74 f0 eb ed c3 cc cc
cc cc
[  150.619204] RSP: 0018:ffffbc4a001e7ed8 EFLAGS: 00000202
[  150.624550] RAX: 0000000000000000 RBX: ffff9dc2c0088000 RCX: 00000000000f4240
[  150.631806] RDX: 0000000000007f0c RSI: 0000000000000008 RDI: ffff9dcabfc30880
[  150.639057] RBP: 0000000000000004 R08: 0000000000000008 R09: ffff9dcabfc311e8
[  150.646314] R10: ffff9dcabfc1fd80 R11: 0000000000000004 R12: ffff9dc2c1e64400
[  150.653569] R13: ffffffff978da0e0 R14: 0000000000000001 R15: 0000000000000000
[  150.660829]  do_idle+0x13a/0x200
[  150.664186]  cpu_startup_entry+0x25/0x30
[  150.668241]  start_secondary+0x114/0x140
[  150.672292]  common_startup_64+0x13e/0x141
[  150.676525]  </TASK>
[  150.678840] Modules linked in: target_core_user(E) uio(E)
target_core_pscsi(E) target_core_file(E) target_core_iblock(E)
nf_conntrack_netlink(E) vhost_net(E) vhost(E) vhost_iotlb(E) tap(E)
tun(E) rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E)
nfs(E) lockd(E) grace(E) netfs(E) netconsole(E)
scsi_transport_iscsi(E) sch_ingress(E) iscsi_target_mod(E)
target_core_mod(E) 8021q(E) garp(E) mrp(E) bonding(E) tls(E)
nfnetlink_cttimeout(E) nfnetlink(E) openvswitch(E) nf_conncount(E)
nf_nat(E) psample(E) ib_core(E) binfmt_misc(E) dell_rbu(E) sunrpc(E)
vfat(E) fat(E) dm_service_time(E) dm_multipath(E) amd_atl(E)
intel_rapl_msr(E) intel_rapl_common(E) amd64_edac(E) ipmi_ssif(E)
edac_mce_amd(E) kvm_amd(E) kvm(E) dell_pc(E) platform_profile(E)
dell_smbios(E) dcdbas(E) mgag200(E) irqbypass(E)
dell_wmi_descriptor(E) wmi_bmof(E) i2c_algo_bit(E) rapl(E)
acpi_cpufreq(E) ptdma(E) i2c_piix4(E) acpi_power_meter(E) ipmi_si(E)
k10temp(E) i2c_smbus(E) acpi_ipmi(E) wmi(E) ipmi_devintf(E)
ipmi_msghandler(E) tcp_bbr(E) fuse(E) zram(E)
[  150.678894]  lz4hc_compress(E) lz4_compress(E) zstd_compress(E)
ext4(E) crc16(E) mbcache(E) jbd2(E) dm_crypt(E) sd_mod(E) sg(E) ice(E)
ahci(E) polyval_clmulni(E) libie(E) libeth_xdp(E) polyval_generic(E)
libahci(E) libeth(E) ghash_clmulni_intel(E) sha512_ssse3(E) libata(E)
ccp(E) megaraid_sas(E) gnss(E) sp5100_tco(E) dm_mirror(E)
dm_region_hash(E) dm_log(E) dm_mod(E) nf_conntrack(E)
nf_defrag_ipv6(E) nf_defrag_ipv4(E) br_netfilter(E) bridge(E) stp(E)
llc(E)
[  150.770112] Unloaded tainted modules: fmpm(E):1 fjes(E):2 padlock_aes(E):2
[  150.818140] ---[ end trace 0000000000000000 ]---
[  150.913536] pstore: backend (erst) writing error (-22)
[  150.918850] RIP: 0010:page_pool_put_unrefed_netmem+0xe2/0x250
[  150.924764] Code: 18 48 85 d2 0f 84 58 ff ff ff 8b 52 2c 4c 89 e7
39 d0 41 0f 94 c5 e8 0d f2 ff ff 84 c0 0f 85 4f ff ff ff 48 8b 85 60
06 00 00 <65> 48 ff 40 20 5b 4c 89 e6 48 89 ef 5d 41 5c 41 5d e9 f8 fa
ff ff
[  150.943854] RSP: 0018:ffffbc4a003fcd18 EFLAGS: 00010246
[  150.949245] RAX: ffff9dcabfc37580 RBX: 00000000ffffffff RCX: 0000000000000000
[  150.956556] RDX: 0000000000000000 RSI: fffff2ec441924c0 RDI: fffff2ec441924c0
[  150.963860] RBP: ffff9dcabfc36f20 R08: ffff9dc330536d20 R09: 0000000000551618
[  150.971166] R10: 0000000000000000 R11: 0000000000000f82 R12: fffff2ec441924c0
[  150.978475] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000001b69
[  150.985782] FS:  0000000000000000(0000) GS:ffff9dcb27946000(0000)
knlGS:0000000000000000
[  150.994036] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  150.999958] CR2: 00007f82d00041b8 CR3: 000000012bcab00a CR4: 0000000000770ef0
[  151.007270] PKRU: 55555554
[  151.010151] Kernel panic - not syncing: Fatal exception in interrupt
[  151.488873] Kernel Offset: 0x14600000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  151.581163] ---[ end Kernel panic - not syncing: Fatal exception in
interrupt ]---

> >
> > Thanks,
> > Michal
> >

Reply via email to