Greetings!!!

IBM CI has reported a kernel warnings while running CPU hot plug operation on IBM Power9 system.


Command to reproduce the issue:

drmgr -c cpu -r -q 1


Git Bisect is pointing to below commit as the first bad commit.


4ae8d9aa9f9dc7137ea5e564d79c5aa5af1bc45c


Traces:


[  464.306613] ------------[ cut here ]------------
[  464.306628] WARNING: CPU: 0 PID: 0 at kernel/sched/cpudeadline.c:219 cpudl_set+0x58/0x170 [  464.306641] Modules linked in: rpadlpar_io(E) rpaphp(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) bonding(E) nft_ct(E) tls(E) rfkill(E) nft_chain_nat(E) ip_set(E) hvcs(E) ibmveth(E) pseries_rng(E) hvcserver(E) vmx_crypto(E) sg(E) dm_multipath(E) drm(E) dm_mod(E) fuse(E) drm_panel_orientation_quirks(E) ext4(E) crc16(E) mbcache(E) jbd2(E) sr_mod(E) sd_mod(E) cdrom(E) ibmvscsi(E) scsi_transport_srp(E) [  464.306703] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G       E       6.17.0-gfd94619c4336 #1 VOLUNTARY
[  464.306711] Tainted: [E]=UNSIGNED_MODULE
[  464.306714] Hardware name: IBM,8375-42A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW950.80 (VL950_131) hv:phyp pSeries [  464.306720] NIP:  c0000000002b6ed8 LR: c0000000002b7cb8 CTR: c0000000002b7df0 [  464.306725] REGS: c000000002c2f5d0 TRAP: 0700   Tainted: G     E        (6.17.0-gfd94619c4336) [  464.306730] MSR:  8000000000021033 <SF,ME,IR,DR,RI,LE>  CR: 22000228  XER: 00000000
[  464.306743] CFAR: c0000000002b726c IRQMASK: 3
[  464.306743] GPR00: c0000000002b7cb8 c000000002c2f870 c000000001df8100 c000000002d6a710 [  464.306743] GPR04: 000000000000001e 0000006c566f51e0 0000000000000000 c000000002d6adb0 [  464.306743] GPR08: 00000000ffffffff 0000000000000001 c000000002cac488 0000000000000000 [  464.306743] GPR12: c0000000030a7000 c000000002fa0000 0000000000000000 0000000000000000 [  464.306743] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [  464.306743] GPR20: c0000009e940ac20 0000006c1aa50360 0000000000000001 0000000000000002 [  464.306743] GPR24: 0000000000000000 0000000000000000 0000000000000003 c0000009e940ab80 [  464.306743] GPR28: 000000000000001e 0000006c566f51e0 c000000002d6a710 000000000000001e
[  464.306804] NIP [c0000000002b6ed8] cpudl_set+0x58/0x170
[  464.306809] LR [c0000000002b7cb8] dl_server_timer+0x168/0x2a0
[  464.306815] Call Trace:
[  464.306818] [c000000002c2f870] [c000000002c2f8c0] init_stack+0x78c0/0x8000 (unreliable) [  464.306828] [c000000002c2f8c0] [c0000000002b7cb8] dl_server_timer+0x168/0x2a0 [  464.306835] [c000000002c2f920] [c00000000034df84] __hrtimer_run_queues+0x1a4/0x390 [  464.306842] [c000000002c2f9b0] [c00000000034f624] hrtimer_interrupt+0x124/0x300 [  464.306849] [c000000002c2fa60] [c00000000002a230] timer_interrupt+0x140/0x320 [  464.306856] [c000000002c2fac0] [c000000000009ffc] decrementer_common_virt+0x28c/0x290
[  464.306865] ---- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x2c
[  464.306872] NIP:  c0000000001b75d8 LR: c0000000001bf274 CTR: 0000000000000000 [  464.306877] REGS: c000000002c2faf0 TRAP: 0900   Tainted: G     E        (6.17.0-gfd94619c4336) [  464.306882] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24000228  XER: 20040000
[  464.306897] CFAR: 0000000000000000 IRQMASK: 0
[  464.306897] GPR00: 0000000000000000 c000000002c2fd90 c000000001df8100 0000000000000000 [  464.306897] GPR04: 0000000000000010 000000002c000040 0000000000000002 0000000000000040 [  464.306897] GPR08: 0000000000000000 0000000000000310 0000000000000031 0000000000000000 [  464.306897] GPR12: 00000000d02f71f1 c000000002fa0000 0000000000000000 0000000000000000 [  464.306897] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [  464.306897] GPR20: 0000000000c00000 0000000000000008 0000000000000000 0000000000000000 [  464.306897] GPR24: 0000000000000000 c000000000000000 c00000000a6e0000 c000000002cad0c0 [  464.306897] GPR28: 0000000000000001 c0000000022418e0 c0000000022418e8 c0000000022418e0
[  464.306956] NIP [c0000000001b75d8] plpar_hcall_norets_notrace+0x18/0x2c
[  464.306962] LR [c0000000001bf274] pseries_lpar_idle.part.0+0x74/0x160
[  464.306967] ---- interrupt: 900
[  464.306970] [c000000002c2fd90] [c0000009e940b3b0] 0xc0000009e940b3b0 (unreliable) [  464.306984] [c000000002c2fe10] [c0000000000212fc] arch_cpu_idle+0x4c/0x110 [  464.306993] [c000000002c2fe30] [c00000000134ddd0] default_idle_call+0x50/0x140 [  464.307001] [c000000002c2fe50] [c0000000002b4fdc] cpuidle_idle_call+0x1ac/0x240
[  464.307007] [c000000002c2fea0] [c0000000002b5164] do_idle+0xf4/0x1a0
[  464.307013] [c000000002c2fef0] [c0000000002b5498] cpu_startup_entry+0x48/0x50
[  464.307020] [c000000002c2ff20] [c0000000000113cc] rest_init+0xec/0xf0
[  464.307026] [c000000002c2ff50] [c0000000020052e0] do_initcalls+0x0/0x18c
[  464.307034] [c000000002c2ffe0] [c00000000000ea9c] start_here_common+0x1c/0x20 [  464.307040] Code: 549c06be 7c9f2378 7cbd2b78 7c7e1b78 39494388 5489e8f8 f8010010 f821ffb1 7d2a482a 7d29e436 552907fe 69290001 <0b090000> 490a428d 60000000 e93e0010
[  464.307060] ---[ end trace 0000000000000000 ]---
[  464.736380] ------------[ cut here ]------------
[  464.736397] WARNING: CPU: 0 PID: 0 at kernel/sched/cpudeadline.c:219 cpudl_set+0x58/0x170 [  464.736408] Modules linked in: rpadlpar_io(E) rpaphp(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) bonding(E) nft_ct(E) tls(E) rfkill(E) nft_chain_nat(E) ip_set(E) hvcs(E) ibmveth(E) pseries_rng(E) hvcserver(E) vmx_crypto(E) sg(E) dm_multipath(E) drm(E) dm_mod(E) fuse(E) drm_panel_orientation_quirks(E) ext4(E) crc16(E) mbcache(E) jbd2(E) sr_mod(E) sd_mod(E) cdrom(E) ibmvscsi(E) scsi_transport_srp(E) [  464.736468] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G   W   E       6.17.0-gfd94619c4336 #1 VOLUNTARY
[  464.736476] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE
[  464.736480] Hardware name: IBM,8375-42A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW950.80 (VL950_131) hv:phyp pSeries [  464.736486] NIP:  c0000000002b6ed8 LR: c0000000002b7cb8 CTR: c0000000002b7df0 [  464.736491] REGS: c000000002c2f4f0 TRAP: 0700   Tainted: G W   E        (6.17.0-gfd94619c4336) [  464.736497] MSR:  8000000000021033 <SF,ME,IR,DR,RI,LE>  CR: 22000424  XER: 00000000
[  464.736509] CFAR: c0000000002b726c IRQMASK: 3
[  464.736509] GPR00: c0000000002b7cb8 c000000002c2f790 c000000001df8100 c000000002d6a710 [  464.736509] GPR04: 000000000000001f 0000006c700d1304 0000000000000000 c000000002d6adb0 [  464.736509] GPR08: 00000000ffffffff 0000000000000001 c000000002cac488 0000000000000000 [  464.736509] GPR12: c0000000030a7000 c000000002fa0000 0000000000000000 0000000000000000 [  464.736509] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [  464.736509] GPR20: c0000009e940ac20 0000006c3442c73b 0000000000000001 0000000000000002 [  464.736509] GPR24: 0000000000000000 0000000000000000 0000000000000003 c0000009e940ab80 [  464.736509] GPR28: 000000000000001f 0000006c700d1304 c000000002d6a710 000000000000001f
[  464.736569] NIP [c0000000002b6ed8] cpudl_set+0x58/0x170
[  464.736574] LR [c0000000002b7cb8] dl_server_timer+0x168/0x2a0
[  464.736580] Call Trace:
[  464.736582] [c000000002c2f790] [c000000002c2f7e0] init_stack+0x77e0/0x8000 (unreliable) [  464.736592] [c000000002c2f7e0] [c0000000002b7cb8] dl_server_timer+0x168/0x2a0 [  464.736599] [c000000002c2f840] [c00000000034df84] __hrtimer_run_queues+0x1a4/0x390 [  464.736606] [c000000002c2f8d0] [c00000000034f624] hrtimer_interrupt+0x124/0x300 [  464.736613] [c000000002c2f980] [c00000000002a230] timer_interrupt+0x140/0x320 [  464.736620] [c000000002c2f9e0] [c000000000009ffc] decrementer_common_virt+0x28c/0x290
[  464.736627] ---- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x2c
[  464.736634] NIP:  c0000000001b75d8 LR: c00000000134dfe8 CTR: 0000000000000000 [  464.736638] REGS: c000000002c2fa10 TRAP: 0900   Tainted: G W   E        (6.17.0-gfd94619c4336) [  464.736644] MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 22000424  XER: 20040000
[  464.736659] CFAR: 0000000000000000 IRQMASK: 0
[  464.736659] GPR00: 0000000000000000 c000000002c2fcb0 c000000001df8100 0000000000000000 [  464.736659] GPR04: 0000000000000010 000000002c000040 0000000000000002 0000000000000040 [  464.736659] GPR08: 0000000000000000 0000000000000290 0000000000000029 0000000000000000 [  464.736659] GPR12: 00000000d02f74a9 c000000002fa0000 0000000000000000 0000000000000000 [  464.736659] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [  464.736659] GPR20: 0000000000c00000 0000000000000008 0000000000000000 0000000000000000 [  464.736659] GPR24: 0000000000000000 0000000000000000 0000006c3469239a 0000000000000001 [  464.736659] GPR28: c0000009e9419cc0 0000000000000001 c0000000022418e0 c0000000022418e8
[  464.736717] NIP [c0000000001b75d8] plpar_hcall_norets_notrace+0x18/0x2c
[  464.736723] LR [c00000000134dfe8] check_and_cede_processor+0x48/0x60
[  464.736730] ---- interrupt: 900
[  464.736733] [c000000002c2fcb0] [c0000000026a1080] init_task+0x0/0x1d00 (unreliable) [  464.736741] [c000000002c2fd10] [c00000000134e210] shared_cede_loop+0x70/0x170 [  464.736748] [c000000002c2fd50] [c00000000134d830] cpuidle_enter_state+0x2b0/0x648
[  464.736756] [c000000002c2fdf0] [c000000000e09f70] cpuidle_enter+0x50/0x80
[  464.736764] [c000000002c2fe30] [c0000000002ad868] call_cpuidle+0x48/0x90
[  464.736772] [c000000002c2fe50] [c0000000002b4f94] cpuidle_idle_call+0x164/0x240
[  464.736779] [c000000002c2fea0] [c0000000002b5164] do_idle+0xf4/0x1a0
[  464.736785] [c000000002c2fef0] [c0000000002b549c] cpu_startup_entry+0x4c/0x50
[  464.736791] [c000000002c2ff20] [c0000000000113cc] rest_init+0xec/0xf0
[  464.736797] [c000000002c2ff50] [c0000000020052e0] do_initcalls+0x0/0x18c
[  464.736804] [c000000002c2ffe0] [c00000000000ea9c] start_here_common+0x1c/0x20 [  464.736810] Code: 549c06be 7c9f2378 7cbd2b78 7c7e1b78 39494388 5489e8f8 f8010010 f821ffb1 7d2a482a 7d29e436 552907fe 69290001 <0b090000> 490a428d 60000000 e93e0010
[  464.736831] ---[ end trace 0000000000000000 ]---
[  493.843328] Non-volatile memory driver v1.3



Git Bisect logs:

git bisect bad
4ae8d9aa9f9dc7137ea5e564d79c5aa5af1bc45c is the first bad commit
commit 4ae8d9aa9f9dc7137ea5e564d79c5aa5af1bc45c (HEAD)
Author: Peter Zijlstra <[email protected]>
Date:   Tue Sep 16 23:02:41 2025 +0200

    sched/deadline: Fix dl_server getting stuck

    John found it was easy to hit lockup warnings when running locktorture
    on a 2 CPU VM, which he bisected down to: commit cccb45d7c429
    ("sched/deadline: Less agressive dl_server handling").

    While debugging it seems there is a chance where we end up with the
    dl_server dequeued, with dl_se->dl_server_active. This causes
    dl_server_start() to return without enqueueing the dl_server, thus it
    fails to run when RT tasks starve the cpu.

    When this happens, dl_server_timer() catches the
    '!dl_se->server_has_tasks(dl_se)' case, which then calls
    replenish_dl_entity() and dl_server_stopped() and finally return
    HRTIMER_NO_RESTART.

    This ends in no new timer and also no enqueue, leaving the dl_server
    'dead', allowing starvation.

    What should have happened is for the bandwidth timer to start the
    zero-laxity timer, which in turn would enqueue the dl_server and cause
    dl_se->server_pick_task() to be called -- which will stop the
    dl_server if no fair tasks are observed for a whole period.

    IOW, it is totally irrelevant if there are fair tasks at the moment of
    bandwidth refresh.

    This removes all dl_se->server_has_tasks() users, so remove the whole
    thing.

    Fixes: cccb45d7c4295 ("sched/deadline: Less agressive dl_server handling")
    Reported-by: John Stultz <[email protected]>
    Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
    Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
    Tested-by: John Stultz <[email protected]>

 include/linux/sched.h   |  1 -
 kernel/sched/deadline.c | 12 +-----------
 kernel/sched/fair.c     |  7 +------
 kernel/sched/sched.h    |  4 ----
 4 files changed, 2 insertions(+), 22 deletions(-)

# git bisect log
git bisect start
# status: waiting for both good and bad commits
# good: [038d61fd642278bab63ee8ef722c50d10ab01e8f] Linux 6.16
git bisect good 038d61fd642278bab63ee8ef722c50d10ab01e8f
# status: waiting for bad commit, 1 good commit known
# bad: [c746c3b5169831d7fb032a1051d8b45592ae8d78] Merge tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
git bisect bad c746c3b5169831d7fb032a1051d8b45592ae8d78
# good: [e25079858627916b22c4a789005a90a9fae808d8] Merge branch 'net-better-drop-accounting'
git bisect good e25079858627916b22c4a789005a90a9fae808d8
# bad: [05a54fa773284d1a7923cdfdd8f0c8dabb98bd26] Merge tag 'sound-6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect bad 05a54fa773284d1a7923cdfdd8f0c8dabb98bd26
# bad: [ae28ed4578e6d5a481e39c5a9827f27048661fdd] Merge tag 'bpf-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
git bisect bad ae28ed4578e6d5a481e39c5a9827f27048661fdd
# bad: [6855f06042ae8d134f96c63feb5dfb3943c6d789] Merge tag 'i2c-for-6.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
git bisect bad 6855f06042ae8d134f96c63feb5dfb3943c6d789
# good: [3d1e36499e02457f8de0edc9d87783cce97e8677] Merge tag 'gpio-fixes-for-v6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
git bisect good 3d1e36499e02457f8de0edc9d87783cce97e8677
# good: [86cc796e5e9bff0c3993607f4301b8188095516c] Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
git bisect good 86cc796e5e9bff0c3993607f4301b8188095516c
# good: [f975f08c2e899ae2484407d7bba6bb7f8b6d9d40] Merge tag 'for-6.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
git bisect good f975f08c2e899ae2484407d7bba6bb7f8b6d9d40
# good: [4ff71af020ae59ae2d83b174646fc2ad9fcd4dc4] Merge tag 'net-6.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
git bisect good 4ff71af020ae59ae2d83b174646fc2ad9fcd4dc4
# good: [f26a24662cd2875f82029e28879a20cea212214c] Merge tag 'v6.17rc7-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
git bisect good f26a24662cd2875f82029e28879a20cea212214c
# bad: [51a24b7deaae5c3561965f5b4b27bb9d686add1c] Merge tag 'trace-tools-v6.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
git bisect bad 51a24b7deaae5c3561965f5b4b27bb9d686add1c
# bad: [083fc6d7fa0d974a3663b97c8b0466737a544236] Merge tag 'sched-urgent-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 083fc6d7fa0d974a3663b97c8b0466737a544236
# good: [2cea0ed9796381b142f46bd8de97bb6b54b1df61] Merge tag 'locking-urgent-2025-09-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 2cea0ed9796381b142f46bd8de97bb6b54b1df61
# bad: [a3a70caf7906708bf9bbc80018752a6b36543808] sched/deadline: Fix dl_server behaviour
git bisect bad a3a70caf7906708bf9bbc80018752a6b36543808
# bad: [4ae8d9aa9f9dc7137ea5e564d79c5aa5af1bc45c] sched/deadline: Fix dl_server getting stuck
git bisect bad 4ae8d9aa9f9dc7137ea5e564d79c5aa5af1bc45c
# first bad commit: [4ae8d9aa9f9dc7137ea5e564d79c5aa5af1bc45c] sched/deadline: Fix dl_server getting stuck


If you happen to fix this, please add below tag.


Reported-by: Venkat Rao Bagalkote <[email protected]>


Regards,

Venkat.




Reply via email to