On 26.09.25 17:26, Janusz Krzysztofik wrote:
> A timer that expires a vgem fence automatically in 10 seconds is now
> released with timer_delete_sync() from fence->ops.release() called on last
> dma_fence_put().  In some scenarios, it can run in IRQ context, which is
> not safe unless TIMER_IRQSAFE is used.  One potentially risky scenario was
> demonstrated in Intel DRM CI trybot, BAT run on machine bat-adlp-6, while
> working on new IGT subtests syncobj_timeline@stress-* as user space
> replacements of some problematic test cases of a dma-fence-chain selftest
> [1].
> 
> [117.004338] ================================
> [117.004340] WARNING: inconsistent lock state
> [117.004342] 6.17.0-rc7-CI_DRM_17270-g7644974e648c+ #1 Tainted: G S   U
> [117.004346] --------------------------------
> [117.004347] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
> [117.004349] swapper/0/0 [HC1[1]:SC1[1]:HE0:SE0] takes:
> [117.004352] ffff888138f86aa8 ((&fence->timer)){?.-.}-{0:0}, at: 
> __timer_delete_sync+0x4b/0x190
> [117.004361] {HARDIRQ-ON-W} state was registered at:
> [117.004363]   lock_acquire+0xc4/0x2e0
> [117.004366]   call_timer_fn+0x80/0x2a0
> [117.004368]   __run_timers+0x231/0x310
> [117.004370]   run_timer_softirq+0x76/0xe0
> [117.004372]   handle_softirqs+0xd4/0x4d0
> [117.004375]   __irq_exit_rcu+0x13f/0x160
> [117.004377]   irq_exit_rcu+0xe/0x20
> [117.004379]   sysvec_apic_timer_interrupt+0xa0/0xc0
> [117.004382]   asm_sysvec_apic_timer_interrupt+0x1b/0x20
> [117.004385]   cpuidle_enter_state+0x12b/0x8a0
> [117.004388]   cpuidle_enter+0x2e/0x50
> [117.004393]   call_cpuidle+0x22/0x60
> [117.004395]   do_idle+0x1fd/0x260
> [117.004398]   cpu_startup_entry+0x29/0x30
> [117.004401]   start_secondary+0x12d/0x160
> [117.004404]   common_startup_64+0x13e/0x141
> [117.004407] irq event stamp: 2282669
> [117.004409] hardirqs last  enabled at (2282668): [<ffffffff8289db71>] 
> _raw_spin_unlock_irqrestore+0x51/0x80
> [117.004414] hardirqs last disabled at (2282669): [<ffffffff82882021>] 
> sysvec_irq_work+0x11/0xc0
> [117.004419] softirqs last  enabled at (2254702): [<ffffffff8289fd00>] 
> __do_softirq+0x10/0x18
> [117.004423] softirqs last disabled at (2254725): [<ffffffff813d4ddf>] 
> __irq_exit_rcu+0x13f/0x160
> [117.004426]
> other info that might help us debug this:
> [117.004429]  Possible unsafe locking scenario:
> [117.004432]        CPU0
> [117.004433]        ----
> [117.004434]   lock((&fence->timer));
> [117.004436]   <Interrupt>
> [117.004438]     lock((&fence->timer));
> [117.004440]
>  *** DEADLOCK ***
> [117.004443] 1 lock held by swapper/0/0:
> [117.004445]  #0: ffffc90000003d50 ((&fence->timer)){?.-.}-{0:0}, at: 
> call_timer_fn+0x7a/0x2a0
> [117.004450]
> stack backtrace:
> [117.004453] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G S   U            
>   6.17.0-rc7-CI_DRM_17270-g7644974e648c+ #1 PREEMPT(voluntary)
> [117.004455] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER
> [117.004455] Hardware name: Intel Corporation Alder Lake Client 
> Platform/AlderLake-P DDR4 RVP, BIOS RPLPFWI1.R00.4035.A00.2301200723 
> 01/20/2023
> [117.004456] Call Trace:
> [117.004456]  <IRQ>
> [117.004457]  dump_stack_lvl+0x91/0xf0
> [117.004460]  dump_stack+0x10/0x20
> [117.004461]  print_usage_bug.part.0+0x260/0x360
> [117.004463]  mark_lock+0x76e/0x9c0
> [117.004465]  ? register_lock_class+0x48/0x4a0
> [117.004467]  __lock_acquire+0xbc3/0x2860
> [117.004469]  lock_acquire+0xc4/0x2e0
> [117.004470]  ? __timer_delete_sync+0x4b/0x190
> [117.004472]  ? __timer_delete_sync+0x4b/0x190
> [117.004473]  __timer_delete_sync+0x68/0x190
> [117.004474]  ? __timer_delete_sync+0x4b/0x190
> [117.004475]  timer_delete_sync+0x10/0x20
> [117.004476]  vgem_fence_release+0x19/0x30 [vgem]
> [117.004478]  dma_fence_release+0xc1/0x3b0
> [117.004480]  ? dma_fence_release+0xa1/0x3b0
> [117.004481]  dma_fence_chain_release+0xe7/0x130
> [117.004483]  dma_fence_release+0xc1/0x3b0
> [117.004484]  ? _raw_spin_unlock_irqrestore+0x27/0x80
> [117.004485]  dma_fence_chain_irq_work+0x59/0x80
> [117.004487]  irq_work_single+0x75/0xa0
> [117.004490]  irq_work_run_list+0x33/0x60
> [117.004491]  irq_work_run+0x18/0x40
> [117.004493]  __sysvec_irq_work+0x35/0x170
> [117.004494]  sysvec_irq_work+0x47/0xc0
> [117.004496]  asm_sysvec_irq_work+0x1b/0x20
> [117.004497] RIP: 0010:_raw_spin_unlock_irqrestore+0x57/0x80
> [117.004499] Code: 00 75 1c 65 ff 0d d9 34 68 01 74 20 5b 41 5c 5d 31 c0 31 
> d2 31 c9 31 f6 31 ff c3 cc cc cc cc e8 7f 9d d3 fe fb 0f 1f 44 00 00 <eb> d7 
> 0f 1f 44 00 00 5b 41 5c 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3
> [117.004499] RSP: 0018:ffffc90000003cf0 EFLAGS: 00000246
> [117.004500] RAX: 0000000000000000 RBX: ffff888155e94c40 RCX: 0000000000000000
> [117.004501] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [117.004502] RBP: ffffc90000003d00 R08: 0000000000000000 R09: 0000000000000000
> [117.004502] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000246
> [117.004502] R13: 0000000000000001 R14: 0000000000000246 R15: ffff888155e94c80
> [117.004506]  dma_fence_signal+0x49/0xb0
> [117.004507]  ? __pfx_vgem_fence_timeout+0x10/0x10 [vgem]
> [117.004508]  vgem_fence_timeout+0x12/0x20 [vgem]
> [117.004509]  call_timer_fn+0xa1/0x2a0
> [117.004512]  ? __pfx_vgem_fence_timeout+0x10/0x10 [vgem]
> [117.004513]  __run_timers+0x231/0x310
> [117.004514]  ? tmigr_handle_remote+0x2ac/0x560
> [117.004517]  timer_expire_remote+0x46/0x70
> [117.004518]  tmigr_handle_remote+0x433/0x560
> [117.004520]  ? __run_timers+0x239/0x310
> [117.004521]  ? run_timer_softirq+0x21/0xe0
> [117.004522]  ? lock_release+0xce/0x2a0
> [117.004524]  run_timer_softirq+0xcf/0xe0
> [117.004525]  handle_softirqs+0xd4/0x4d0
> [117.004526]  __irq_exit_rcu+0x13f/0x160
> [117.004527]  irq_exit_rcu+0xe/0x20
> [117.004528]  sysvec_apic_timer_interrupt+0xa0/0xc0
> [117.004529]  </IRQ>
> [117.004529]  <TASK>
> [117.004529]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
> [117.004530] RIP: 0010:cpuidle_enter_state+0x12b/0x8a0
> [117.004532] Code: 48 0f a3 05 97 ce 0e 01 0f 82 2e 03 00 00 31 ff e8 8a 41 
> bd fe 80 7d d0 00 0f 85 11 03 00 00 e8 8b 06 d5 fe fb 0f 1f 44 00 00 <45> 85 
> f6 0f 88 67 02 00 00 4d 63 ee 49 83 fd 0a 0f 83 34 06 00 00
> [117.004532] RSP: 0018:ffffffff83403d88 EFLAGS: 00000246
> [117.004533] RAX: 0000000000000000 RBX: ffff88888f046440 RCX: 0000000000000000
> [117.004533] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [117.004534] RBP: ffffffff83403dd8 R08: 0000000000000000 R09: 0000000000000000
> [117.004534] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff837cbe80
> [117.004534] R13: 0000000000000004 R14: 0000000000000004 R15: 0000001ad1df466b
> [117.004537]  ? cpuidle_enter_state+0x125/0x8a0
> [117.004538]  ? sched_clock_noinstr+0x9/0x10
> [117.004540]  cpuidle_enter+0x2e/0x50
> [117.004542]  call_cpuidle+0x22/0x60
> [117.004542]  do_idle+0x1fd/0x260
> [117.004544]  cpu_startup_entry+0x29/0x30
> [117.004546]  rest_init+0x104/0x200
> [117.004548]  start_kernel+0x93d/0xbd0
> [117.004550]  ? load_ucode_intel_bsp+0x2a/0x90
> [117.004551]  ? sme_unmap_bootdata+0x14/0x80
> [117.004554]  x86_64_start_reservations+0x18/0x30
> [117.004555]  x86_64_start_kernel+0xfd/0x150
> [117.004556]  ? soft_restart_cpu+0x14/0x14
> [117.004558]  common_startup_64+0x13e/0x141
> [117.004560]  </TASK>
> [117.004565] ------------[ cut here ]------------
> [117.004692] WARNING: CPU: 0 PID: 0 at kernel/time/timer.c:1610 
> __timer_delete_sync+0x126/0x190
> [117.004697] Modules linked in: vgem snd_hda_codec_intelhdmi 
> snd_hda_codec_hdmi i915 prime_numbers ttm drm_buddy drm_display_helper cec 
> rc_core i2c_algo_bit hid_sensor_custom hid_sensor_hub hid_generic 
> intel_ishtp_hid hid intel_uncore_frequency intel_uncore_frequency_common 
> x86_pkg_temp_thermal intel_powerclamp cmdlinepart ee1004 r8153_ecm spi_nor 
> coretemp cdc_ether mei_pxp mei_hdcp usbnet mtd intel_rapl_msr wmi_bmof 
> kvm_intel snd_hda_intel snd_intel_dspcfg processor_thermal_device_pci kvm 
> snd_hda_codec processor_thermal_device irqbypass processor_thermal_wt_hint 
> polyval_clmulni platform_temperature_control snd_hda_core ghash_clmulni_intel 
> processor_thermal_rfim spi_pxa2xx_platform snd_hwdep aesni_intel 
> processor_thermal_rapl dw_dmac snd_pcm dw_dmac_core intel_rapl_common r8152 
> rapl mii intel_cstate spi_pxa2xx_core i2c_i801 processor_thermal_wt_req 
> snd_timer i2c_mux mei_me intel_ish_ipc processor_thermal_power_floor e1000e 
> snd i2c_smbus spi_intel_pci processor_thermal_mbox mei soundcore intel_ishtp 
> thunderbolt idma64
> [117.004733]  spi_intel int340x_thermal_zone igen6_edac binfmt_misc 
> intel_skl_int3472_tps68470 intel_pmc_core tps68470_regulator video 
> clk_tps68470 pmt_telemetry pmt_discovery nls_iso8859_1 pmt_class 
> intel_pmc_ssram_telemetry intel_skl_int3472_discrete int3400_thermal 
> intel_hid intel_skl_int3472_common acpi_thermal_rel intel_vsec wmi 
> pinctrl_tigerlake acpi_tad sparse_keymap acpi_pad dm_multipath msr 
> nvme_fabrics fuse efi_pstore nfnetlink autofs4
> [117.004782] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G S   U            
>   6.17.0-rc7-CI_DRM_17270-g7644974e648c+ #1 PREEMPT(voluntary)
> [117.004787] Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER
> [117.004789] Hardware name: Intel Corporation Alder Lake Client 
> Platform/AlderLake-P DDR4 RVP, BIOS RPLPFWI1.R00.4035.A00.2301200723 
> 01/20/2023
> [117.004793] RIP: 0010:__timer_delete_sync+0x126/0x190
> [117.004795] Code: 31 c0 45 31 c9 c3 cc cc cc cc 48 8b 75 d0 45 84 f6 74 63 
> 49 c7 45 18 00 00 00 00 48 89 c7 e8 51 46 39 01 f3 90 e9 66 ff ff ff <0f> 0b 
> e9 5f ff ff ff e8 ee e4 0c 00 49 8d 5d 28 45 31 c9 31 c9 4c
> [117.004801] RSP: 0018:ffffc90000003a40 EFLAGS: 00010046
> [117.004804] RAX: ffffffff815093fb RBX: ffff888138f86aa8 RCX: 0000000000000000
> [117.004807] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [117.004809] RBP: ffffc90000003a70 R08: 0000000000000000 R09: 0000000000000000
> [117.004812] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff815093fb
> [117.004814] R13: ffff888138f86a80 R14: 0000000000000000 R15: 0000000000000000
> [117.004817] FS:  0000000000000000(0000) GS:ffff88890b0f7000(0000) 
> knlGS:0000000000000000
> [117.004820] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [117.004823] CR2: 00005db8131eb7f0 CR3: 0000000003448000 CR4: 0000000000f52ef0
> [117.004826] PKRU: 55555554
> [117.004827] Call Trace:
> [117.004829]  <IRQ>
> [117.004831]  timer_delete_sync+0x10/0x20
> [117.004833]  vgem_fence_release+0x19/0x30 [vgem]
> [117.004836]  dma_fence_release+0xc1/0x3b0
> [117.004838]  ? dma_fence_release+0xa1/0x3b0
> [117.004841]  dma_fence_chain_release+0xe7/0x130
> [117.004844]  dma_fence_release+0xc1/0x3b0
> [117.004847]  ? _raw_spin_unlock_irqrestore+0x27/0x80
> [117.004850]  dma_fence_chain_irq_work+0x59/0x80
> [117.004853]  irq_work_single+0x75/0xa0
> [117.004857]  irq_work_run_list+0x33/0x60
> [117.004860]  irq_work_run+0x18/0x40
> [117.004863]  __sysvec_irq_work+0x35/0x170
> [117.004865]  sysvec_irq_work+0x47/0xc0
> [117.004868]  asm_sysvec_irq_work+0x1b/0x20
> [117.004871] RIP: 0010:_raw_spin_unlock_irqrestore+0x57/0x80
> [117.004874] Code: 00 75 1c 65 ff 0d d9 34 68 01 74 20 5b 41 5c 5d 31 c0 31 
> d2 31 c9 31 f6 31 ff c3 cc cc cc cc e8 7f 9d d3 fe fb 0f 1f 44 00 00 <eb> d7 
> 0f 1f 44 00 00 5b 41 5c 5d 31 c0 31 d2 31 c9 31 f6 31 ff c3
> [117.004879] RSP: 0018:ffffc90000003cf0 EFLAGS: 00000246
> [117.004882] RAX: 0000000000000000 RBX: ffff888155e94c40 RCX: 0000000000000000
> [117.004884] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [117.004887] RBP: ffffc90000003d00 R08: 0000000000000000 R09: 0000000000000000
> [117.004890] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000246
> [117.004892] R13: 0000000000000001 R14: 0000000000000246 R15: ffff888155e94c80
> [117.004897]  dma_fence_signal+0x49/0xb0
> [117.004899]  ? __pfx_vgem_fence_timeout+0x10/0x10 [vgem]
> [117.004902]  vgem_fence_timeout+0x12/0x20 [vgem]
> [117.004904]  call_timer_fn+0xa1/0x2a0
> [117.004908]  ? __pfx_vgem_fence_timeout+0x10/0x10 [vgem]
> [117.004910]  __run_timers+0x231/0x310
> [117.004913]  ? tmigr_handle_remote+0x2ac/0x560
> [117.004917]  timer_expire_remote+0x46/0x70
> [117.004919]  tmigr_handle_remote+0x433/0x560
> [117.004923]  ? __run_timers+0x239/0x310
> [117.004925]  ? run_timer_softirq+0x21/0xe0
> [117.004928]  ? lock_release+0xce/0x2a0
> [117.004931]  run_timer_softirq+0xcf/0xe0
> [117.004933]  handle_softirqs+0xd4/0x4d0
> [117.004936]  __irq_exit_rcu+0x13f/0x160
> [117.004938]  irq_exit_rcu+0xe/0x20
> [117.004940]  sysvec_apic_timer_interrupt+0xa0/0xc0
> [117.004943]  </IRQ>
> [117.004944]  <TASK>
> [117.004946]  asm_sysvec_apic_timer_interrupt+0x1b/0x20
> [117.004949] RIP: 0010:cpuidle_enter_state+0x12b/0x8a0
> [117.004953] Code: 48 0f a3 05 97 ce 0e 01 0f 82 2e 03 00 00 31 ff e8 8a 41 
> bd fe 80 7d d0 00 0f 85 11 03 00 00 e8 8b 06 d5 fe fb 0f 1f 44 00 00 <45> 85 
> f6 0f 88 67 02 00 00 4d 63 ee 49 83 fd 0a 0f 83 34 06 00 00
> [117.004961] RSP: 0018:ffffffff83403d88 EFLAGS: 00000246
> [117.004963] RAX: 0000000000000000 RBX: ffff88888f046440 RCX: 0000000000000000
> [117.004966] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> [117.004968] RBP: ffffffff83403dd8 R08: 0000000000000000 R09: 0000000000000000
> [117.004971] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff837cbe80
> [117.004974] R13: 0000000000000004 R14: 0000000000000004 R15: 0000001ad1df466b
> [117.004978]  ? cpuidle_enter_state+0x125/0x8a0
> [117.004981]  ? sched_clock_noinstr+0x9/0x10
> [117.004985]  cpuidle_enter+0x2e/0x50
> [117.004989]  call_cpuidle+0x22/0x60
> [117.004991]  do_idle+0x1fd/0x260
> [117.005001]  cpu_startup_entry+0x29/0x30
> [117.005004]  rest_init+0x104/0x200
> [117.005008]  start_kernel+0x93d/0xbd0
> [117.005011]  ? load_ucode_intel_bsp+0x2a/0x90
> [117.005014]  ? sme_unmap_bootdata+0x14/0x80
> [117.005017]  x86_64_start_reservations+0x18/0x30
> [117.005020]  x86_64_start_kernel+0xfd/0x150
> [117.005023]  ? soft_restart_cpu+0x14/0x14
> [117.005026]  common_startup_64+0x13e/0x141
> [117.005030]  </TASK>
> [117.005032] irq event stamp: 2282669
> [117.005034] hardirqs last  enabled at (2282668): [<ffffffff8289db71>] 
> _raw_spin_unlock_irqrestore+0x51/0x80
> [117.005038] hardirqs last disabled at (2282669): [<ffffffff82882021>] 
> sysvec_irq_work+0x11/0xc0
> [117.005043] softirqs last  enabled at (2254702): [<ffffffff8289fd00>] 
> __do_softirq+0x10/0x18
> [117.005047] softirqs last disabled at (2254725): [<ffffffff813d4ddf>] 
> __irq_exit_rcu+0x13f/0x160
> [117.005051] ---[ end trace 0000000000000000 ]---
> 
> Make the timer IRQ safe.
> 
> [1] https://patchwork.freedesktop.org/series/154987/#rev2
> 
> Fixes: 4077798484459 ("drm/vgem: Attach sw fences to exported vGEM dma-buf 
> (ioctl)")
> Signed-off-by: Janusz Krzysztofik <[email protected]>

Reviewed-by: Christian König <[email protected]>

We should also consider to lower the timeout massively. This needs to be in the 
range of 100m-1s to not cause the same trouble as the sw_sync framework.

10seconds is just way to long for that.

Regards,
Christian.

> ---
>  drivers/gpu/drm/vgem/vgem_fence.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/vgem/vgem_fence.c 
> b/drivers/gpu/drm/vgem/vgem_fence.c
> index fd76730fd38c0..07db319c3d7f9 100644
> --- a/drivers/gpu/drm/vgem/vgem_fence.c
> +++ b/drivers/gpu/drm/vgem/vgem_fence.c
> @@ -79,7 +79,7 @@ static struct dma_fence *vgem_fence_create(struct vgem_file 
> *vfile,
>       dma_fence_init(&fence->base, &vgem_fence_ops, &fence->lock,
>                      dma_fence_context_alloc(1), 1);
>  
> -     timer_setup(&fence->timer, vgem_fence_timeout, 0);
> +     timer_setup(&fence->timer, vgem_fence_timeout, TIMER_IRQSAFE);
>  
>       /* We force the fence to expire within 10s to prevent driver hangs */
>       mod_timer(&fence->timer, jiffies + VGEM_FENCE_TIMEOUT);

Reply via email to