On Mon, Jun 2, 2025 at 9:58 PM Rafael J. Wysocki <raf...@kernel.org> wrote: > > On Mon, Jun 2, 2025 at 5:22 PM Mario Limonciello <supe...@kernel.org> wrote: > > > > On 6/2/2025 9:29 AM, Rafael J. Wysocki wrote: > > > On Mon, Jun 2, 2025 at 2:11 PM Chris Bainbridge > > > <chris.bainbri...@gmail.com> wrote: > > >> > > >> On Fri, Mar 14, 2025 at 02:13:53PM +0100, Rafael J. Wysocki wrote: > > >>> From: Rafael J. Wysocki <rafael.j.wyso...@intel.com> > > >>> > > >>> In analogy with the previous change affecting the resume path, > > >>> make device_suspend() start the async suspend of the device's parent > > >>> after the device itself has been processed and make dpm_suspend() start > > >>> processing "async" leaf devices (that is, devices without children) > > >>> upfront so they don't need to wait for the "sync" devices they don't > > >>> depend on. > > >>> > > >>> On the Dell XPS13 9360 in my office, this change reduces the total > > >>> duration of device suspend by approximately 100 ms (over 20%). > > >>> > > >>> Suggested-by: Saravana Kannan <sarava...@google.com> > > >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wyso...@intel.com> > > >> > > >> This commit results in memory corruption on suspend/resume with short > > >> suspend duration. > > > > > > What do you mean by short? > > > > The tool he used will program a timer to wake up the system. > > The time he input was programmed for a cycle that was short enough that > > the suspend entry didn't finish and it triggered an aborted suspend. > > So it crashes during resume when the preceding suspend has been aborted IIUC. > > The exact crash mechanism is still unclear to me though. > > > > > > >> Laptop appears to hang and crash is logged to pstore. > > > > > > Interesting that this is only happening on one system. > > > > > > Thanks for the report anyway, I'll look at this shortly. > > > > > >> To reproduce: `amd_s2idle.py --log log --duration 1 --wait 4 --count 30` > > >> > > >> I have reproduced this both with and without Mario's recent suspend fix > > >> https://lore.kernel.org/amd-gfx/20250602014432.3538345-1-supe...@kernel.org/T/#t > > >> > > >> Pstore log (with Mario's fix): > > >> > > >> <6>[ 194.209939] PM: suspend entry (s2idle) > > >> <6>[ 194.409450] Filesystems sync: 0.199 seconds > > >> <6>[ 194.409756] Freezing user space processes > > >> <6>[ 194.411374] Freezing user space processes completed (elapsed 0.001 > > >> seconds) > > >> <6>[ 194.411377] OOM killer disabled. > > >> <6>[ 194.411378] Freezing remaining freezable tasks > > >> <6>[ 194.412517] Freezing remaining freezable tasks completed (elapsed > > >> 0.001 seconds) > > >> <6>[ 194.412520] printk: Suspending console(s) (use no_console_suspend > > >> to debug) > > >> <7>[ 194.663906] PM: suspend of devices aborted after 0.260 msecs > > >> <7>[ 194.663911] PM: start suspend of devices aborted after 251.365 > > >> msecs > > >> <3>[ 194.663913] PM: Some devices failed to suspend, or early wake > > >> event detected > > >> <4>[ 194.663975] i2c i2c-3: Unbalanced pm_runtime_enable! > > >> <4>[ 194.663989] ee1004 3-0050: Attempt to enable runtime PM when it is > > >> blocked > > >> Oops#1 Part6 > > >> <4>[ 194.663991] ee1004 3-0051: Attempt to enable runtime PM when it is > > >> blocked > > >> <4>[ 194.663992] CPU: 5 UID: 0 PID: 121 Comm: kworker/u64:10 Not > > >> tainted 6.15.0-rc1-00006-g032a79431b1c #425 PREEMPT(voluntary) > > >> <4>[ 194.663994] Hardware name: HP HP Pavilion Aero Laptop > > >> 13-be0xxx/8916, BIOS F.17 12/18/2024 > > >> <4>[ 194.663996] Workqueue: async async_run_entry_fn > > >> <4>[ 194.663998] slab kmalloc-2k > > >> <4>[ 194.664000] > > >> <4>[ 194.664000] start ffff99bbe24ac800 pointer offset 408 > > >> <4>[ 194.664001] Call Trace: > > >> <4>[ 194.664002] size 2048 > > >> <3>[ 194.664003] list_add corruption. prev->next should be next > > >> (ffffffff9da75c60), but was ffff99bbd1d94790. (prev=ffff99bbe24ac998). > > >> <4>[ 194.664003] <TASK> > > >> <4>[ 194.664007] dump_stack_lvl+0x6e/0x90 > > >> <4>[ 194.664011] ------------[ cut here ]------------ > > >> <4>[ 194.664011] pm_runtime_enable.cold+0x28/0x48 > > >> <2>[ 194.664011] kernel BUG at lib/list_debug.c:32! > > >> <4>[ 194.664013] device_resume+0x47/0x200 > > >> <4>[ 194.664016] Oops: invalid opcode: 0000 [#1] SMP > > >> <4>[ 194.664017] async_resume+0x1d/0x30 > > >> <4>[ 194.664018] CPU: 2 UID: 0 PID: 2505 Comm: amd_s2idle.py Not > > >> tainted 6.15.0-rc1-00006-g032a79431b1c #425 PREEMPT(voluntary) > > >> <4>[ 194.664019] async_run_entry_fn+0x2e/0x130 > > >> <4>[ 194.664020] Hardware name: HP HP Pavilion Aero Laptop > > >> 13-be0xxx/8916, BIOS F.17 12/18/2024 > > >> <4>[ 194.664021] RIP: 0010:__list_add_valid_or_report+0x90/0xa0 > > >> <4>[ 194.664022] process_one_work+0x22b/0x5b0 > > >> <4>[ 194.664024] Code: e4 8a ff 0f 0b 48 89 f7 48 89 34 24 e8 49 57 c6 > > >> ff 48 8b 34 24 48 c7 c7 70 d1 64 9d 48 8b 16 48 89 f1 48 89 de e8 00 e4 > > >> 8a ff <0f> 0b 90 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 41 54 > > >> Oops#1 Part5 > > >> <4>[ 194.664025] RSP: 0018:ffffc09a45dafb20 EFLAGS: 00010246 > > >> <4>[ 194.664026] worker_thread+0x1da/0x3d0 > > >> <4>[ 194.664027] RAX: 0000000000000075 RBX: ffffffff9da75c60 RCX: > > >> 0000000000000027 > > >> <4>[ 194.664028] RDX: 0000000000000000 RSI: 0000000000000001 RDI: > > >> ffff99becd11de40 > > >> <4>[ 194.664029] RBP: ffffffff9da74c00 R08: 0000000000000000 R09: > > >> 0000000000000000 > > >> <4>[ 194.664029] R10: 0000000000000000 R11: 0000000000000003 R12: > > >> 0000000000000010 > > >> <4>[ 194.664029] ? bh_worker+0x260/0x260 > > >> <4>[ 194.664030] R13: 0000002990e47f3d R14: ffff99bbe24ac998 R15: > > >> ffff99bbe0b67620 > > >> <4>[ 194.664031] FS: 00007fe534bfc080(0000) GS:ffff99bf2ee50000(0000) > > >> knlGS:0000000000000000 > > >> <4>[ 194.664031] kthread+0x10a/0x250 > > >> <4>[ 194.664032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > >> <4>[ 194.664033] CR2: 000055fdfaf910b8 CR3: 000000010d4a9000 CR4: > > >> 0000000000f50ef0 > > >> <4>[ 194.664034] ? kthreads_online_cpu+0x130/0x130 > > >> <4>[ 194.664034] PKRU: 55555554 > > >> <4>[ 194.664035] Call Trace: > > >> <4>[ 194.664036] <TASK> > > >> <4>[ 194.664036] ret_from_fork+0x31/0x50 > > >> <4>[ 194.664037] dpm_resume+0x139/0x350 > > >> <4>[ 194.664039] ? kthreads_online_cpu+0x130/0x130 > > >> <4>[ 194.664041] dpm_resume_end+0x11/0x20 > > >> <4>[ 194.664040] ret_from_fork_asm+0x11/0x20 > > >> <4>[ 194.664042] suspend_devices_and_enter+0x18e/0x9f0 > > >> <4>[ 194.664045] </TASK> > > >> <4>[ 194.664046] pm_suspend.cold+0x22f/0x28f > > >> <4>[ 194.664046] CPU: 4 UID: 0 PID: 115 Comm: kworker/u64:4 Not tainted > > >> 6.15.0-rc1-00006-g032a79431b1c #425 PREEMPT(voluntary) > > >> Oops#1 Part4 > > >> <4>[ 194.664048] state_store+0x6c/0xd0 > > >> <4>[ 194.664049] Hardware name: HP HP Pavilion Aero Laptop > > >> 13-be0xxx/8916, BIOS F.17 12/18/2024 > > >> <4>[ 194.664050] Workqueue: async async_run_entry_fn > > >> <4>[ 194.664051] kernfs_fop_write_iter+0x194/0x250 > > >> <4>[ 194.664052] Call Trace: > > >> <4>[ 194.664052] <TASK> > > >> <4>[ 194.664053] dump_stack_lvl+0x6e/0x90 > > >> <4>[ 194.664054] vfs_write+0x2ac/0x550 > > >> <4>[ 194.664055] pm_runtime_enable.cold+0x28/0x48 > > >> <4>[ 194.664057] device_resume+0x47/0x200 > > >> <4>[ 194.664058] ksys_write+0x71/0xe0 > > >> <4>[ 194.664060] async_resume+0x1d/0x30 > > >> <4>[ 194.664060] do_syscall_64+0x95/0x1a0 > > >> <4>[ 194.664062] async_run_entry_fn+0x2e/0x130 > > >> <4>[ 194.664062] ? lockdep_sys_exit+0x1e/0x90 > > >> <4>[ 194.664064] process_one_work+0x22b/0x5b0 > > >> <4>[ 194.664064] ? trace_hardirqs_on_prepare+0x77/0xa0 > > >> <4>[ 194.664066] ? syscall_exit_to_user_mode+0xb1/0x280 > > >> <4>[ 194.664067] worker_thread+0x1da/0x3d0 > > >> <4>[ 194.664068] ? __mutex_lock+0xdb/0xed0 > > >> <4>[ 194.664070] ? __mutex_lock+0xafb/0xed0 > > >> <4>[ 194.664070] ? bh_worker+0x260/0x260 > > >> <4>[ 194.664072] ? kernfs_fop_llseek+0x35/0xd0 > > >> <4>[ 194.664072] kthread+0x10a/0x250 > > >> <4>[ 194.664073] ? lock_release+0x1ff/0x2a0 > > >> <4>[ 194.664074] ? kthreads_online_cpu+0x130/0x130 > > >> <4>[ 194.664075] ? lock_acquire+0x270/0x2d0 > > >> <4>[ 194.664076] ret_from_fork+0x31/0x50 > > >> <4>[ 194.664077] ? __mutex_unlock_slowpath+0x3c/0x2c0 > > >> <4>[ 194.664078] ? kthreads_online_cpu+0x130/0x130 > > >> <4>[ 194.664079] ? kernfs_fop_llseek+0x77/0xd0 > > >> <4>[ 194.664079] ret_from_fork_asm+0x11/0x20 > > >> <4>[ 194.664081] ? lockdep_sys_exit+0x1e/0x90 > > >> <4>[ 194.664082] ? trace_hardirqs_on_prepare+0x77/0xa0 > > >> Oops#1 Part3 > > >> <4>[ 194.664084] </TASK> > > >> <4>[ 194.664084] ? syscall_exit_to_user_mode+0xb1/0x280 > > >> <4>[ 194.664086] ? do_syscall_64+0xa1/0x1a0 > > >> <4>[ 194.664086] uvcvideo 1-3:1.0: Unbalanced pm_runtime_enable! > > So it looks like this device is resumed even though it has not been suspended.
Or it is resumed for the second time in a row during the same transition. I think I know what is going on and the bug is not in the commit in question. There is a race between dpm_async_resume_children() and the first loop in dpm_resume() which fortunately is only fatal when the preceding suspend transition is aborted. Namely, that loop can call dpm_clear_async_state() for a device after dpm_async_with_cleanup() has run for it, so power.work_in_progress gets cleared and __dpm_async() will allow an async work to be scheduled for the same device once again. Chris, please check if the attached patch helps. I'm going to post it as a fix anyway later today, but it would be good to verify that it is sufficient. Thanks!
--- drivers/base/power/main.c | 7 +++++++ 1 file changed, 7 insertions(+) --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -638,6 +638,13 @@ static void dpm_async_resume_children(struct device *dev, async_func_t func) { /* + * Prevent racing with dpm_clear_async_state() during initial list + * walks in dpm_noirq_resume_devices(), dpm_resume_early(), and + * dpm_resume(). + */ + guard(mutex)(&dpm_list_mtx); + + /* * Start processing "async" children of the device unless it's been * started already for them. *