On Thursday, 18 May 2023 12:52:24 CEST David Reviejo wrote: > Seems to be an amdgpu bug introduced two or three kernel releases ago, as > you can see googling around; for example here: > > https://bugzilla.redhat.com/show_bug.cgi?id=2191739 > > or here: > > https://gitlab.freedesktop.org/drm/amd/-/issues/2522 > > If it's this bug, the fix seems to be in yesterday's latest upstream > kernel update: 6.3.3 (and 6.1.29 for the stable longterm).
I _think_ I got the right commit for the 6.3 branch attached. Para 4.5(.2) of the Debian Kernel Handbook describes how to test a simple patch: https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official Could you try that and see whether it indeed fixes your issue?
>From c5123c193696bf97fdf259c825ebfac517b54e44 Mon Sep 17 00:00:00 2001 From: Guchun Chen <guchun.c...@amd.com> Date: Sat, 6 May 2023 16:52:59 +0800 Subject: drm/amdgpu: disable sdma ecc irq only when sdma RAS is enabled in suspend commit 8b229ada2669b74fdae06c83fbfda5a5a99fc253 upstream. sdma_v4_0_ip is shared on a few asics, but in sdma_v4_0_hw_fini, driver unconditionally disables ecc_irq which is only enabled on those asics enabling sdma ecc. This will introduce a warning in suspend cycle on those chips with sdma ip v4.0, while without sdma ecc. So this patch correct this. [ 7283.166354] RIP: 0010:amdgpu_irq_put+0x45/0x70 [amdgpu] [ 7283.167001] RSP: 0018:ffff9a5fc3967d08 EFLAGS: 00010246 [ 7283.167019] RAX: ffff98d88afd3770 RBX: 0000000000000001 RCX: 0000000000000000 [ 7283.167023] RDX: 0000000000000000 RSI: ffff98d89da30390 RDI: ffff98d89da20000 [ 7283.167025] RBP: ffff98d89da20000 R08: 0000000000036838 R09: 0000000000000006 [ 7283.167028] R10: ffffd5764243c008 R11: 0000000000000000 R12: ffff98d89da30390 [ 7283.167030] R13: ffff98d89da38978 R14: ffffffff999ae15a R15: ffff98d880130105 [ 7283.167032] FS: 0000000000000000(0000) GS:ffff98d996f00000(0000) knlGS:0000000000000000 [ 7283.167036] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7283.167039] CR2: 00000000f7a9d178 CR3: 00000001c42ea000 CR4: 00000000003506e0 [ 7283.167041] Call Trace: [ 7283.167046] <TASK> [ 7283.167048] sdma_v4_0_hw_fini+0x38/0xa0 [amdgpu] [ 7283.167704] amdgpu_device_ip_suspend_phase2+0x101/0x1a0 [amdgpu] [ 7283.168296] amdgpu_device_suspend+0x103/0x180 [amdgpu] [ 7283.168875] amdgpu_pmops_freeze+0x21/0x60 [amdgpu] [ 7283.169464] pci_pm_freeze+0x54/0xc0 Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2522 Signed-off-by: Guchun Chen <guchun.c...@amd.com> Reviewed-by: Tao Zhou <tao.zh...@amd.com> Signed-off-by: Alex Deucher <alexander.deuc...@amd.com> Cc: sta...@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gre...@linuxfoundation.org> --- drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c index b5affba221569..8b8ddf0502661 100644 --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c @@ -1903,9 +1903,11 @@ static int sdma_v4_0_hw_fini(void *handle) return 0; } - for (i = 0; i < adev->sdma.num_instances; i++) { - amdgpu_irq_put(adev, &adev->sdma.ecc_irq, - AMDGPU_SDMA_IRQ_INSTANCE0 + i); + if (amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__SDMA)) { + for (i = 0; i < adev->sdma.num_instances; i++) { + amdgpu_irq_put(adev, &adev->sdma.ecc_irq, + AMDGPU_SDMA_IRQ_INSTANCE0 + i); + } } sdma_v4_0_ctx_switch_enable(adev, false); -- cgit
signature.asc
Description: This is a digitally signed message part.