RE: [PATCH] drm/amdgpu: drop drm_firmware_drivers_only()
[AMD Official Use Only - AMD Internal Distribution Only] Reviewed-by: Kent Russell > -Original Message- > From: amd-gfx On Behalf Of Alex > Deucher > Sent: Thursday, March 13, 2025 5:05 PM > To: amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander > Subject: [PATCH] drm/amdgpu: drop drm_firmware_drivers_only() > > There are a number of systems and cloud providers out there > that have nomodeset hardcoded in their kernel parameters > to block nouveau for the nvidia driver. This prevents the > amdgpu driver from loading. Unfortunately the end user cannot > easily change this. The preferred way to block modules from > loading is to use modprobe.blacklist=. That is what > providers should be using to block specific drivers. > > Drop the check to allow the driver to load even when nomodeset > is specified on the kernel command line. > > Signed-off-by: Alex Deucher > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > index 3b34fdd105937..dd86661153582 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c > @@ -3022,9 +3022,6 @@ static int __init amdgpu_init(void) > { > int r; > > - if (drm_firmware_drivers_only()) > - return -EINVAL; > - > r = amdgpu_sync_init(); > if (r) > goto error_sync; > -- > 2.48.1
Re: [PATCH 5/8] drm/amdgpu: rework how the cleaner shader is emitted v3
On 2/18/2025 9:43 PM, Christian König wrote: Instead of emitting the cleaner shader for every job which has the enforce_isolation flag set only emit it for the first submission from every client. v2: add missing NULL check v3: fix another NULL pointer deref Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 27 -- 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index c9c48b782ec1..5323dba2d688 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -643,6 +643,7 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job, bool need_pipe_sync) { struct amdgpu_device *adev = ring->adev; + struct amdgpu_isolation *isolation = &adev->isolation[ring->xcp_id]; unsigned vmhub = ring->vm_hub; struct amdgpu_vmid_mgr *id_mgr = &adev->vm_manager.id_mgr[vmhub]; struct amdgpu_vmid *id = &id_mgr->ids[job->vmid]; @@ -650,8 +651,9 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job, bool gds_switch_needed = ring->funcs->emit_gds_switch && job->gds_switch_needed; bool vm_flush_needed = job->vm_needs_flush; - struct dma_fence *fence = NULL; + bool cleaner_shader_needed = false; bool pasid_mapping_needed = false; + struct dma_fence *fence = NULL; unsigned int patch; int r; @@ -674,8 +676,12 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job, pasid_mapping_needed &= adev->gmc.gmc_funcs->emit_pasid_mapping && ring->funcs->emit_wreg; + cleaner_shader_needed = adev->gfx.enable_cleaner_shader && + ring->funcs->emit_cleaner_shader && job->base.s_fence && + &job->base.s_fence->scheduled == isolation->spearhead; + if (!vm_flush_needed && !gds_switch_needed && !need_pipe_sync && - !(job->enforce_isolation && !job->vmid)) + !cleaner_shader_needed) return 0; amdgpu_ring_ib_begin(ring); @@ -686,9 +692,7 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job, if (need_pipe_sync) amdgpu_ring_emit_pipeline_sync(ring); - if (adev->gfx.enable_cleaner_shader && - ring->funcs->emit_cleaner_shader && - job->enforce_isolation) + if (cleaner_shader_needed) ring->funcs->emit_cleaner_shader(ring); if (vm_flush_needed) { @@ -710,7 +714,7 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job, job->oa_size); } - if (vm_flush_needed || pasid_mapping_needed) { + if (vm_flush_needed || pasid_mapping_needed || cleaner_shader_needed) { r = amdgpu_fence_emit(ring, &fence, NULL, 0); if (r) return r; @@ -732,6 +736,17 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job, id->pasid_mapping = dma_fence_get(fence); mutex_unlock(&id_mgr->lock); } + + /* +* Make sure that all other submissions wait for the cleaner shader to +* finish before we push them to the HW. +*/ + if (cleaner_shader_needed) { + mutex_lock(&adev->enforce_isolation_mutex); Should we need to ensure spearhead is valid before putting? if (isolation->spearhead) dma_fence_put(isolation->spearhead); + dma_fence_put(isolation->spearhead); + isolation->spearhead = dma_fence_get(fence); + mutex_unlock(&adev->enforce_isolation_mutex); + } dma_fence_put(fence); amdgpu_ring_patch_cond_exec(ring, patch);
Re: [PATCH 1/8] drm/amdgpu: grab an additional reference on the gang fence v2
On 3/7/2025 7:18 PM, Christian König wrote: We keep the gang submission fence around in adev, make sure that it stays alive. v2: fix memory leak on retry Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 198d29faa754..337543ec615c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -6889,18 +6889,26 @@ struct dma_fence *amdgpu_device_switch_gang(struct amdgpu_device *adev, { struct dma_fence *old = NULL; + dma_fence_get(gang); do { dma_fence_put(old); old = amdgpu_device_get_gang(adev); if (old == gang) break; - if (!dma_fence_is_signaled(old)) + if (!dma_fence_is_signaled(old)) { Here, should we need to check ? // Check if old fence isn't signaled if (old && !dma_fence_is_signaled(old)) { + dma_fence_put(gang); return old; + } } while (cmpxchg((struct dma_fence __force **)&adev->gang_submit, old, gang) != old); + /* +* Drop it once for the exchanged reference in adev and once for the +* thread local reference acquired in amdgpu_device_get_gang(). +*/ + dma_fence_put(old); if (old) dma_fence_put(old); // Ensure to release old reference only if it is valid? dma_fence_put(old); return NULL; }
Re: commit 7ffb791423c7 breaks steam game
On 3/14/25 09:22, Bert Karwatzki wrote: > Am Freitag, dem 14.03.2025 um 08:54 +1100 schrieb Balbir Singh: >> On 3/14/25 05:12, Bert Karwatzki wrote: >>> Am Donnerstag, dem 13.03.2025 um 22:47 +1100 schrieb Balbir Singh: Anyway, I think the nokaslr result is interesting, it seems like with nokaslr even the older kernels have problems with the game Could you confirm if with nokaslr >>> Now I've tested kernel 6.8.12 with nokaslr >>> 1. Only one single game stellaris is not working? 2. The entire laptop does not work? 3. Laptop works and other games work? Just one game is not working as >>> expected? >>> >>> >>> Stellaris is showing the input lag and the entire graphical user interface >>> shows >>> the same input lag as long as stellaris is running. >>> Civilization 6 shows the same input lag as stellaris, probably even worse. >>> Magic the Gathering: Arena (with wine) works normally. >>> Valheim also works normally. >>> Crusader Kings 2 works normally >>> Rogue Heroes: Ruins of Tasos (a Zelda lookalike) works normally. >>> Baldur's Gate I & II and Icewind Dale work normally. >>> >>> Also the input lag is only in the GUI, if I switch to a text console (ctrl >>> + alt >>> + Fn), input works normally even while the affected games are running. >>> >>> Games aside everything else (e.g. compiling kernels) seems to work with >>> nokaslr. >>> >> >> Would it be fair to assume that anything Xorg/Wayland is working fine >> when the game is not running, even with nokaslr? >> > Yes, Xorg (I'm normally using xfce4 as desktop) works fine. I also tested with > gnome using Xwayland, here the buggy behaviour also exists, with the addtion > that mouse position is off, i.e. to click a button in the game you have to > click > somewhat above it. So the issue is narrowed down to just the games you've mentioned with nokaslr/patch? > >> +amd-gfx@lists.freedesktop.org to see if there are known issues with >> nokaslr and the games you mentioned. >> >> >> Balbir Singh >> >> PS: I came across an interesting link >> https://www.alex-ionescu.com/behind-windows-x64s-44-bit-memory-addressing-limit/ >> >> I think SLIST_HEADER is used by wine as well for user space and I am not sure >> if in this situation the game is hitting this scenario, but surprisingly the >> other >> games are not. This is assuming the game uses wine. I am not sure it's >> related, >> but the 44 bits caught my attention. > > Stellaris is a native linux game (x86_64), the one game (MTGA) I tested with > wine worked fine. > Thanks for the update! I wonder if there are any game logs. If you can look for any warnings/errors it might be interesting to see where the difference is coming from? > By the way, the warning > [ T8562] WARNING: CPU: 14 PID: 8562 at mm/slub.c:5028 > __kvmalloc_node_noprof+0x2fd/0x360 > that appeared in the dmesg I sent you is caused by the upgrade of mesa from > 25.0.0 to 25.0.1. (I'm still bisecting ...) > Thanks! Balbir
Re: [PATCH 5/8] drm/amdgpu: rework how the cleaner shader is emitted v3
On 3/7/2025 7:18 PM, Christian König wrote: Instead of emitting the cleaner shader for every job which has the enforce_isolation flag set only emit it for the first submission from every client. v2: add missing NULL check v3: fix another NULL pointer deref Signed-off-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 27 -- 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index ef4fe2df8398..dc10bea836db 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -643,6 +643,7 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job, bool need_pipe_sync) { struct amdgpu_device *adev = ring->adev; + struct amdgpu_isolation *isolation = &adev->isolation[ring->xcp_id]; unsigned vmhub = ring->vm_hub; struct amdgpu_vmid_mgr *id_mgr = &adev->vm_manager.id_mgr[vmhub]; struct amdgpu_vmid *id = &id_mgr->ids[job->vmid]; @@ -650,8 +651,9 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job, bool gds_switch_needed = ring->funcs->emit_gds_switch && job->gds_switch_needed; bool vm_flush_needed = job->vm_needs_flush; - struct dma_fence *fence = NULL; + bool cleaner_shader_needed = false; bool pasid_mapping_needed = false; + struct dma_fence *fence = NULL; unsigned int patch; int r; @@ -674,8 +676,12 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job, pasid_mapping_needed &= adev->gmc.gmc_funcs->emit_pasid_mapping && ring->funcs->emit_wreg; + cleaner_shader_needed = adev->gfx.enable_cleaner_shader && + ring->funcs->emit_cleaner_shader && job->base.s_fence && + &job->base.s_fence->scheduled == isolation->spearhead; + if (!vm_flush_needed && !gds_switch_needed && !need_pipe_sync && - !(job->enforce_isolation && !job->vmid)) + !cleaner_shader_needed) return 0; amdgpu_ring_ib_begin(ring); @@ -686,9 +692,7 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job, if (need_pipe_sync) amdgpu_ring_emit_pipeline_sync(ring); - if (adev->gfx.enable_cleaner_shader && - ring->funcs->emit_cleaner_shader && - job->enforce_isolation) + if (cleaner_shader_needed) Here should we also need to check, for ring->funcs->emit_cleaner_shader? if (cleaner_shader_needed && ring->funcs->emit_cleaner_shader) ring->funcs->emit_cleaner_shader(ring); if (vm_flush_needed) { @@ -710,7 +714,7 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job, job->oa_size); } - if (vm_flush_needed || pasid_mapping_needed) { + if (vm_flush_needed || pasid_mapping_needed || cleaner_shader_needed) { r = amdgpu_fence_emit(ring, &fence, NULL, 0); if (r) return r; @@ -732,6 +736,17 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job, id->pasid_mapping = dma_fence_get(fence); mutex_unlock(&id_mgr->lock); } + + /* +* Make sure that all other submissions wait for the cleaner shader to +* finish before we push them to the HW. +*/ + if (cleaner_shader_needed) { + mutex_lock(&adev->enforce_isolation_mutex); + dma_fence_put(isolation->spearhead); + isolation->spearhead = dma_fence_get(fence); + mutex_unlock(&adev->enforce_isolation_mutex); + } dma_fence_put(fence); amdgpu_ring_patch_cond_exec(ring, patch);
Re: [PATCH] drm/amd/display: avoid NPD when ASIC does not support DMUB
On 2025-03-13 07:29, Thadeu Lima de Souza Cascardo wrote: On Wed, Feb 05, 2025 at 10:06:38AM -0300, Thadeu Lima de Souza Cascardo wrote: ctx->dmub_srv will de NULL if the ASIC does not support DMUB, which is tested in dm_dmub_sw_init. However, it will be dereferenced in dmub_hw_lock_mgr_cmd if should_use_dmub_lock returns true. This has been the case since dmub support has been added for PSR1. This bug has landed on stable trees. Any chance for a review here? Thanks. Cascardo. Thanks for the ping and fix! Reviewed-by: Leo Li Fix this by checking for dmub_srv in should_use_dmub_lock. [ 37.440832] BUG: kernel NULL pointer dereference, address: 0058 [ 37.447808] #PF: supervisor read access in kernel mode [ 37.452959] #PF: error_code(0x) - not-present page [ 37.458112] PGD 0 P4D 0 [ 37.460662] Oops: Oops: [#1] PREEMPT SMP NOPTI [ 37.465553] CPU: 2 UID: 1000 PID: 1745 Comm: DrmThread Not tainted 6.14.0-rc1-3-gd62e938120f0 #23 99720e1cb1e0fc4773b8513150932a07de3c6e88 [ 37.478324] Hardware name: Google Morphius/Morphius, BIOS Google_Morphius.13434.858.0 10/26/2023 [ 37.487103] RIP: 0010:dmub_hw_lock_mgr_cmd+0x77/0xb0 [ 37.492074] Code: 44 24 0e 00 00 00 00 48 c7 04 24 45 00 00 0c 40 88 74 24 0d 0f b6 02 88 44 24 0c 8b 01 89 44 24 08 85 f6 75 05 c6 44 24 0e 01 <48> 8b 7f 58 48 89 e6 ba 01 00 00 00 e8 08 3c 2a 00 65 48 8b 04 5 [ 37.510822] RSP: 0018:969442853300 EFLAGS: 00010202 [ 37.516052] RAX: RBX: 92db0300 RCX: 969442853358 [ 37.523185] RDX: 969442853368 RSI: 0001 RDI: [ 37.530322] RBP: 0001 R08: 04a7 R09: 04a5 [ 37.537453] R10: 0476 R11: 0062 R12: 92db0ade8000 [ 37.544589] R13: 92da01180ae0 R14: 92da011802a8 R15: 92db0300 [ 37.551725] FS: 784a9cdfc6c0() GS:92db2af0() knlGS: [ 37.559814] CS: 0010 DS: ES: CR0: 80050033 [ 37.565562] CR2: 0058 CR3: 000112b1c000 CR4: 003506f0 [ 37.572697] Call Trace: [ 37.575152] [ 37.577258] ? __die_body+0x66/0xb0 [ 37.580756] ? page_fault_oops+0x3e7/0x4a0 [ 37.584861] ? exc_page_fault+0x3e/0xe0 [ 37.588706] ? exc_page_fault+0x5c/0xe0 [ 37.592550] ? asm_exc_page_fault+0x22/0x30 [ 37.596742] ? dmub_hw_lock_mgr_cmd+0x77/0xb0 [ 37.601107] dcn10_cursor_lock+0x1e1/0x240 [ 37.605211] program_cursor_attributes+0x81/0x190 [ 37.609923] commit_planes_for_stream+0x998/0x1ef0 [ 37.614722] update_planes_and_stream_v2+0x41e/0x5c0 [ 37.619703] dc_update_planes_and_stream+0x78/0x140 [ 37.624588] amdgpu_dm_atomic_commit_tail+0x4362/0x49f0 [ 37.629832] ? srso_return_thunk+0x5/0x5f [ 37.633847] ? mark_held_locks+0x6d/0xd0 [ 37.637774] ? _raw_spin_unlock_irq+0x24/0x50 [ 37.642135] ? srso_return_thunk+0x5/0x5f [ 37.646148] ? lockdep_hardirqs_on+0x95/0x150 [ 37.650510] ? srso_return_thunk+0x5/0x5f [ 37.654522] ? _raw_spin_unlock_irq+0x2f/0x50 [ 37.658883] ? srso_return_thunk+0x5/0x5f [ 37.662897] ? wait_for_common+0x186/0x1c0 [ 37.666998] ? srso_return_thunk+0x5/0x5f [ 37.671009] ? drm_crtc_next_vblank_start+0xc3/0x170 [ 37.675983] commit_tail+0xf5/0x1c0 [ 37.679478] drm_atomic_helper_commit+0x2a2/0x2b0 [ 37.684186] drm_atomic_commit+0xd6/0x100 [ 37.688199] ? __cfi___drm_printfn_info+0x10/0x10 [ 37.692911] drm_atomic_helper_update_plane+0xe5/0x130 [ 37.698054] drm_mode_cursor_common+0x501/0x670 [ 37.702600] ? __cfi_drm_mode_cursor_ioctl+0x10/0x10 [ 37.707572] drm_mode_cursor_ioctl+0x48/0x70 [ 37.711851] drm_ioctl_kernel+0xf2/0x150 [ 37.715781] drm_ioctl+0x363/0x590 [ 37.719189] ? __cfi_drm_mode_cursor_ioctl+0x10/0x10 [ 37.724165] amdgpu_drm_ioctl+0x41/0x80 [ 37.728013] __se_sys_ioctl+0x7f/0xd0 [ 37.731685] do_syscall_64+0x87/0x100 [ 37.735355] ? vma_end_read+0x12/0xe0 [ 37.739024] ? srso_return_thunk+0x5/0x5f [ 37.743041] ? find_held_lock+0x47/0xf0 [ 37.746884] ? vma_end_read+0x12/0xe0 [ 37.750552] ? srso_return_thunk+0x5/0x5f [ 37.754565] ? lock_release+0x1c4/0x2e0 [ 37.758406] ? vma_end_read+0x12/0xe0 [ 37.762079] ? exc_page_fault+0x84/0xe0 [ 37.765921] ? srso_return_thunk+0x5/0x5f [ 37.769938] ? lockdep_hardirqs_on+0x95/0x150 [ 37.774303] ? srso_return_thunk+0x5/0x5f [ 37.778317] ? exc_page_fault+0x84/0xe0 [ 37.782163] entry_SYSCALL_64_after_hwframe+0x55/0x5d [ 37.787218] RIP: 0033:0x784aa5ec3059 [ 37.790803] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1d 48 8b 45 c8 64 48 2b 04 25 28 00 0 [ 37.809553] RSP: 002b:784a9cdf90e0 EFLAGS: 0246 ORIG_RAX: 0010 [ 37.817121] RAX: ffda RBX: 784a9cdf917c RCX: 784aa5ec3059 [ 37.824256] RDX: 784a9cdf917c RSI: c01c64a3 RDI: 0
Re: [PATCH] drm/amd/display: avoid NPD when ASIC does not support DMUB
On Wed, Feb 05, 2025 at 10:06:38AM -0300, Thadeu Lima de Souza Cascardo wrote: > ctx->dmub_srv will de NULL if the ASIC does not support DMUB, which is > tested in dm_dmub_sw_init. > > However, it will be dereferenced in dmub_hw_lock_mgr_cmd if > should_use_dmub_lock returns true. > > This has been the case since dmub support has been added for PSR1. This bug has landed on stable trees. Any chance for a review here? Thanks. Cascardo. > > Fix this by checking for dmub_srv in should_use_dmub_lock. > > [ 37.440832] BUG: kernel NULL pointer dereference, address: 0058 > [ 37.447808] #PF: supervisor read access in kernel mode > [ 37.452959] #PF: error_code(0x) - not-present page > [ 37.458112] PGD 0 P4D 0 > [ 37.460662] Oops: Oops: [#1] PREEMPT SMP NOPTI > [ 37.465553] CPU: 2 UID: 1000 PID: 1745 Comm: DrmThread Not tainted > 6.14.0-rc1-3-gd62e938120f0 #23 99720e1cb1e0fc4773b8513150932a07de3c6e88 > [ 37.478324] Hardware name: Google Morphius/Morphius, BIOS > Google_Morphius.13434.858.0 10/26/2023 > [ 37.487103] RIP: 0010:dmub_hw_lock_mgr_cmd+0x77/0xb0 > [ 37.492074] Code: 44 24 0e 00 00 00 00 48 c7 04 24 45 00 00 0c 40 88 74 24 > 0d 0f b6 02 88 44 24 0c 8b 01 89 44 24 08 85 f6 75 05 c6 44 24 0e 01 <48> 8b > 7f 58 48 89 e6 ba 01 00 00 00 e8 08 3c 2a 00 65 48 8b 04 5 > [ 37.510822] RSP: 0018:969442853300 EFLAGS: 00010202 > [ 37.516052] RAX: RBX: 92db0300 RCX: > 969442853358 > [ 37.523185] RDX: 969442853368 RSI: 0001 RDI: > > [ 37.530322] RBP: 0001 R08: 04a7 R09: > 04a5 > [ 37.537453] R10: 0476 R11: 0062 R12: > 92db0ade8000 > [ 37.544589] R13: 92da01180ae0 R14: 92da011802a8 R15: > 92db0300 > [ 37.551725] FS: 784a9cdfc6c0() GS:92db2af0() > knlGS: > [ 37.559814] CS: 0010 DS: ES: CR0: 80050033 > [ 37.565562] CR2: 0058 CR3: 000112b1c000 CR4: > 003506f0 > [ 37.572697] Call Trace: > [ 37.575152] > [ 37.577258] ? __die_body+0x66/0xb0 > [ 37.580756] ? page_fault_oops+0x3e7/0x4a0 > [ 37.584861] ? exc_page_fault+0x3e/0xe0 > [ 37.588706] ? exc_page_fault+0x5c/0xe0 > [ 37.592550] ? asm_exc_page_fault+0x22/0x30 > [ 37.596742] ? dmub_hw_lock_mgr_cmd+0x77/0xb0 > [ 37.601107] dcn10_cursor_lock+0x1e1/0x240 > [ 37.605211] program_cursor_attributes+0x81/0x190 > [ 37.609923] commit_planes_for_stream+0x998/0x1ef0 > [ 37.614722] update_planes_and_stream_v2+0x41e/0x5c0 > [ 37.619703] dc_update_planes_and_stream+0x78/0x140 > [ 37.624588] amdgpu_dm_atomic_commit_tail+0x4362/0x49f0 > [ 37.629832] ? srso_return_thunk+0x5/0x5f > [ 37.633847] ? mark_held_locks+0x6d/0xd0 > [ 37.637774] ? _raw_spin_unlock_irq+0x24/0x50 > [ 37.642135] ? srso_return_thunk+0x5/0x5f > [ 37.646148] ? lockdep_hardirqs_on+0x95/0x150 > [ 37.650510] ? srso_return_thunk+0x5/0x5f > [ 37.654522] ? _raw_spin_unlock_irq+0x2f/0x50 > [ 37.658883] ? srso_return_thunk+0x5/0x5f > [ 37.662897] ? wait_for_common+0x186/0x1c0 > [ 37.666998] ? srso_return_thunk+0x5/0x5f > [ 37.671009] ? drm_crtc_next_vblank_start+0xc3/0x170 > [ 37.675983] commit_tail+0xf5/0x1c0 > [ 37.679478] drm_atomic_helper_commit+0x2a2/0x2b0 > [ 37.684186] drm_atomic_commit+0xd6/0x100 > [ 37.688199] ? __cfi___drm_printfn_info+0x10/0x10 > [ 37.692911] drm_atomic_helper_update_plane+0xe5/0x130 > [ 37.698054] drm_mode_cursor_common+0x501/0x670 > [ 37.702600] ? __cfi_drm_mode_cursor_ioctl+0x10/0x10 > [ 37.707572] drm_mode_cursor_ioctl+0x48/0x70 > [ 37.711851] drm_ioctl_kernel+0xf2/0x150 > [ 37.715781] drm_ioctl+0x363/0x590 > [ 37.719189] ? __cfi_drm_mode_cursor_ioctl+0x10/0x10 > [ 37.724165] amdgpu_drm_ioctl+0x41/0x80 > [ 37.728013] __se_sys_ioctl+0x7f/0xd0 > [ 37.731685] do_syscall_64+0x87/0x100 > [ 37.735355] ? vma_end_read+0x12/0xe0 > [ 37.739024] ? srso_return_thunk+0x5/0x5f > [ 37.743041] ? find_held_lock+0x47/0xf0 > [ 37.746884] ? vma_end_read+0x12/0xe0 > [ 37.750552] ? srso_return_thunk+0x5/0x5f > [ 37.754565] ? lock_release+0x1c4/0x2e0 > [ 37.758406] ? vma_end_read+0x12/0xe0 > [ 37.762079] ? exc_page_fault+0x84/0xe0 > [ 37.765921] ? srso_return_thunk+0x5/0x5f > [ 37.769938] ? lockdep_hardirqs_on+0x95/0x150 > [ 37.774303] ? srso_return_thunk+0x5/0x5f > [ 37.778317] ? exc_page_fault+0x84/0xe0 > [ 37.782163] entry_SYSCALL_64_after_hwframe+0x55/0x5d > [ 37.787218] RIP: 0033:0x784aa5ec3059 > [ 37.790803] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 > 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <41> 89 > c0 3d 00 f0 ff ff 77 1d 48 8b 45 c8 64 48 2b 04 25 28 00 0 > [ 37.809553] RSP: 002b:784a9cdf90e0 EFLAGS: 0246 ORIG_RAX: > 0010 > [ 37.817121] RAX: ffda RBX: 784a9cdf917c RCX:
Re: [PATCH] drm/amdgpu/gfx12: correct cleanup of 'me' field with gfx_v12_0_me_fini()
Applied. Thanks! Alex On Wed, Mar 12, 2025 at 6:09 AM Wentao Liang wrote: > > In gfx_v12_0_cp_gfx_load_me_microcode_rs64(), gfx_v12_0_pfp_fini() is > incorrectly used to free 'me' field of 'gfx', since gfx_v12_0_pfp_fini() > can only release 'pfp' field of 'gfx'. The release function of 'me' field > should be gfx_v12_0_me_fini(). > > Fixes: 52cb80c12e8a ("drm/amdgpu: Add gfx v12_0 ip block support (v6)") > Cc: sta...@vger.kernel.org # 6.11+ > Signed-off-by: Wentao Liang > --- > drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c > index da327ab48a57..02bc2eddf0c0 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c > @@ -2413,7 +2413,7 @@ static int > gfx_v12_0_cp_gfx_load_me_microcode_rs64(struct amdgpu_device *adev) > (void **)&adev->gfx.me.me_fw_data_ptr); > if (r) { > dev_err(adev->dev, "(%d) failed to create me data bo\n", r); > - gfx_v12_0_pfp_fini(adev); > + gfx_v12_0_me_fini(adev); > return r; > } > > -- > 2.42.0.windows.2 >
Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks
I think as long as the locking is correct, the src shouldn't matter. You just need to stop the kernel queues (and save state) and evict the user queues (since HWS is responsible for saving the MQDs of the non-guilty user queues). If KFD detected the hang (e.g., queue eviction fails when called for memory pressure, etc.), we just need to make sure that it's ok for the sdma reset routine to call evict queues again even if it was already called (presumably it should exit early since the queues are already evicted). If KGD initiates the reset, it will call into KFD to evict queues. We just need to make sure the evict queues function we call just evicts the queues and doesn't also try and reset. Alex On Wed, Mar 12, 2025 at 5:24 AM Zhang, Jesse(Jie) wrote: > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > > From: Koenig, Christian > Sent: Wednesday, March 12, 2025 4:39 PM > To: Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org > Cc: Deucher, Alexander ; Kim, Jonathan > ; Zhu, Jiadong > Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by > removing dynamic callbacks > > > > Am 12.03.25 um 09:15 schrieb Zhang, Jesse(Jie): > > [SNIP9 > > - > > + gfx_ring->funcs->stop_queue(adev, instance_id); > > > > Yeah that starts to look good. Question here is who is calling > amdgpu_sdma_reset_engine()? > > > > If this call comes from engine specific code we might not need the > start/stop_queue callbacks all together. > > > > Kfd and sdma v4/v5/v5_2 will call amdgpu_sdma_reset_engine, and > start/stop_queue callbacks are only implemented in sdmav4/sdmav5/sdma5_2. > > > Why would the KFD call this as well? Because it detects an issue with a SDMA > user queue If yes I would rather suggest that the KFD calls the reset > function of the paging queue. > > Since this reset function is specific to the SDMA HW generation anyway you > don't need those extra functions to abstract starting and stopping of the > queue for each HW generation. > > kfd can't call reset function directly, unless we add a parameter src to > distinguish kfd and kgd in reset function, like this: > > int (*reset)(struct amdgpu_ring *ring, unsigned int vmid, int src ); > > As Alex said in another thread, > > We need to distinguish kfd and kgd in reset. > > If kfd triggers a reset, kgd must save healthy jobs and recover jobs after > reset. > > If kgd triggers a reset, kgd must abandon bad jobs after reset.(and perhaps > kfd needs to save its healthy jobs for recovery). > > > > If we can add a parameter, I am ok for that solution too. > > > > Additionally: > > For sdma6/7, when a queue reset fails, we may need to fall back to an engine > reset for a attempt. > > > > Thanks > > Jesse > > > Regards, > Christian. > > > > > > > Thanks > > Jesse > > > > Regards, > > Christian. > > > > /* Perform the SDMA reset for the specified instance */ > > ret = amdgpu_dpm_reset_sdma(adev, 1 << instance_id); > > if (ret) { > > @@ -591,18 +573,7 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device *adev, > uint32_t instance_id, b > > goto exit; > > } > > > > - /* Invoke all registered post_reset callbacks */ > > - list_for_each_entry(funcs, &adev->sdma.reset_callback_list, list) { > > - if (funcs->post_reset) { > > - ret = funcs->post_reset(adev, instance_id); > > - if (ret) { > > - dev_err(adev->dev, > > - "afterReset callback failed for instance %u: > %d\n", > > - instance_id, ret); > > - goto exit; > > - } > > - } > > - } > > + gfx_ring->funcs->start_queue(adev, instance_id); > > > > exit: > > /* Restart the scheduler's work queue for the GFX and page rings > > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > > b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > > index fd34dc138081..c1f7ccff9c4e 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > > @@ -2132,6 +2132,8 @@ static const struct amdgpu_ring_funcs > sdma_v4_4_2_ring_funcs = { > > .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait, > > .emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper, > > .reset = sdma_v4_4_2_reset_queue, > > + .stop_queue = sdma_v4_4_2_stop_queue, > > + .start_queue = sdma_v4_4_2_restore_queue, > > .is_guilty = sdma_v4_4_2_ring_is_guilty, }; > > > > > >
Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks
Am 12.03.25 um 10:23 schrieb Zhang, Jesse(Jie): > > [AMD Official Use Only - AMD Internal Distribution Only] > > > > > > > *From:*Koenig, Christian > *Sent:* Wednesday, March 12, 2025 4:39 PM > *To:* Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org > *Cc:* Deucher, Alexander ; Kim, Jonathan > ; Zhu, Jiadong > *Subject:* Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by > removing dynamic callbacks > > > > Am 12.03.25 um 09:15 schrieb Zhang, Jesse(Jie): > > [SNIP9 > > - > > + gfx_ring->funcs->stop_queue(adev, instance_id); > > > > Yeah that starts to look good. Question here is who is calling > amdgpu_sdma_reset_engine()? > > > > If this call comes from engine specific code we might not need the > start/stop_queue callbacks all together. > > > > Kfd and sdma v4/v5/v5_2 will call amdgpu_sdma_reset_engine, and > start/stop_queue callbacks are only implemented in sdmav4/sdmav5/sdma5_2. > > > Why would the KFD call this as well? Because it detects an issue with a SDMA > user queue If yes I would rather suggest that the KFD calls the reset > function of the paging queue. > > Since this reset function is specific to the SDMA HW generation anyway you > don't need those extra functions to abstract starting and stopping of the > queue for each HW generation. > > kfd can't call reset function directly, unless we add a parameter src to > distinguish kfd and kgd in reset function, like this: > > int (*reset)(struct amdgpu_ring *ring, unsigned int vmid, */int src/* ); > > As Alex said in another thread, > > We need to distinguish kfd and kgd in reset. > > If kfd triggers a reset, kgd must save healthy jobs and recover jobs after > reset. > > If kgd triggers a reset, kgd must abandon bad jobs after reset.(and perhaps > kfd needs to save its healthy jobs for recovery). > I don't think the source of the reset should be relevant to the reset procedure. The source is basically just the first one who runs into a timeout, that can be both KFD and KGD. But the cause of the timeout is not necessary the one who signals that a timeout happens. So as far as I can see you should not have that as parameter either. > > > If we can add a parameter, I am ok for that solution too. > > > > Additionally: > > For sdma6/7, when a queue reset fails, we may need to fall back to an engine > reset for a attempt. > Yeah, but that should be trivial. Regards, Christian. > > > Thanks > > Jesse > > > Regards, > Christian. > > > > > > > Thanks > > Jesse > > > > Regards, > > Christian. > > > > /* Perform the SDMA reset for the specified instance */ > > ret = amdgpu_dpm_reset_sdma(adev, 1 << instance_id); > > if (ret) { > > @@ -591,18 +573,7 @@ int amdgpu_sdma_reset_engine(struct > amdgpu_device *adev, uint32_t instance_id, b > > goto exit; > > } > > > > - /* Invoke all registered post_reset callbacks */ > > - list_for_each_entry(funcs, &adev->sdma.reset_callback_list, > list) { > > - if (funcs->post_reset) { > > - ret = funcs->post_reset(adev, instance_id); > > - if (ret) { > > - dev_err(adev->dev, > > - "afterReset callback failed for > instance %u: %d\n", > > - instance_id, ret); > > - goto exit; > > - } > > - } > > - } > > + gfx_ring->funcs->start_queue(adev, instance_id); > > > > exit: > > /* Restart the scheduler's work queue for the GFX and page rings > > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > > b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > > index fd34dc138081..c1f7ccff9c4e 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c > > @@ -2132,6 +2132,8 @@ static const struct amdgpu_ring_funcs > sdma_v4_4_2_ring_funcs = { > > .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait, > > .emit_reg_write_reg_wait = > amdgpu_ring_emit_reg_write_reg_wait_helper, > > .reset = sdma_v4_4_2_reset_queue, > > + .stop_queue = sdma_v4_4_2_stop_queue, > > + .start_queue = sdma_v4_4_2_restore_queue, > > .is_guilty = sdma_v4_4_2_ring_is_guilty, }; > > > > > > >
[PATCH 08/11] drm/amdgpu/gfx12: add support for disable_kq
Plumb in support for disabling kernel queues. v2: use ring counts per Felix' suggestion v3: fix stream fault handler, enable EOP interrupts v4: fix MEC interrupt offset (Sunil) Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 183 + 1 file changed, 125 insertions(+), 58 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c index 34cf187e72d9f..a99507e4fdb27 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c @@ -1421,11 +1421,13 @@ static int gfx_v12_0_sw_init(struct amdgpu_ip_block *ip_block) break; } - /* recalculate compute rings to use based on hardware configuration */ - num_compute_rings = (adev->gfx.mec.num_pipe_per_mec * -adev->gfx.mec.num_queue_per_pipe) / 2; - adev->gfx.num_compute_rings = min(adev->gfx.num_compute_rings, - num_compute_rings); + if (adev->gfx.num_compute_rings) { + /* recalculate compute rings to use based on hardware configuration */ + num_compute_rings = (adev->gfx.mec.num_pipe_per_mec * +adev->gfx.mec.num_queue_per_pipe) / 2; + adev->gfx.num_compute_rings = min(adev->gfx.num_compute_rings, + num_compute_rings); + } /* EOP Event */ r = amdgpu_irq_add_id(adev, SOC21_IH_CLIENTID_GRBM_CP, @@ -1471,37 +1473,41 @@ static int gfx_v12_0_sw_init(struct amdgpu_ip_block *ip_block) return r; } - /* set up the gfx ring */ - for (i = 0; i < adev->gfx.me.num_me; i++) { - for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) { - for (k = 0; k < adev->gfx.me.num_pipe_per_me; k++) { - if (!amdgpu_gfx_is_me_queue_enabled(adev, i, k, j)) - continue; - - r = gfx_v12_0_gfx_ring_init(adev, ring_id, - i, k, j); - if (r) - return r; - ring_id++; + if (adev->gfx.num_gfx_rings) { + /* set up the gfx ring */ + for (i = 0; i < adev->gfx.me.num_me; i++) { + for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) { + for (k = 0; k < adev->gfx.me.num_pipe_per_me; k++) { + if (!amdgpu_gfx_is_me_queue_enabled(adev, i, k, j)) + continue; + + r = gfx_v12_0_gfx_ring_init(adev, ring_id, + i, k, j); + if (r) + return r; + ring_id++; + } } } } - ring_id = 0; - /* set up the compute queues - allocate horizontally across pipes */ - for (i = 0; i < adev->gfx.mec.num_mec; ++i) { - for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; j++) { - for (k = 0; k < adev->gfx.mec.num_pipe_per_mec; k++) { - if (!amdgpu_gfx_is_mec_queue_enabled(adev, - 0, i, k, j)) - continue; + if (adev->gfx.num_compute_rings) { + ring_id = 0; + /* set up the compute queues - allocate horizontally across pipes */ + for (i = 0; i < adev->gfx.mec.num_mec; ++i) { + for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; j++) { + for (k = 0; k < adev->gfx.mec.num_pipe_per_mec; k++) { + if (!amdgpu_gfx_is_mec_queue_enabled(adev, +0, i, k, j)) + continue; - r = gfx_v12_0_compute_ring_init(adev, ring_id, - i, k, j); - if (r) - return r; + r = gfx_v12_0_compute_ring_init(adev, ring_id, + i, k, j); + if (r) + return r; - ring_id++; + ring_id++; + } }
Re: [PATCH 1/2] drm: Create an app info option for wedge events
On Wed, Mar 12, 2025 at 06:59:33PM -0300, André Almeida wrote: > Em 12/03/2025 07:06, Raag Jadav escreveu: > > On Tue, Mar 11, 2025 at 07:09:45PM +0200, Raag Jadav wrote: > > > On Mon, Mar 10, 2025 at 06:27:53PM -0300, André Almeida wrote: > > > > Em 01/03/2025 02:53, Raag Jadav escreveu: > > > > > On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote: > > > > > > Hi Raag, > > > > > > > > > > > > On 2/28/25 11:20, Raag Jadav wrote: > > > > > > > Cc: Lucas > > > > > > > > > > > > > > On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote: > > > > > > > > When a device get wedged, it might be caused by a guilty > > > > > > > > application. > > > > > > > > For userspace, knowing which app was the cause can be useful > > > > > > > > for some > > > > > > > > situations, like for implementing a policy, logs or for giving > > > > > > > > a chance > > > > > > > > for the compositor to let the user know what app caused the > > > > > > > > problem. > > > > > > > > This is an optional argument, when `PID=-1` there's no > > > > > > > > information about > > > > > > > > the app caused the problem, or if any app was involved during > > > > > > > > the hang. > > > > > > > > > > > > > > > > Sometimes just the PID isn't enough giving that the app might > > > > > > > > be already > > > > > > > > dead by the time userspace will try to check what was this > > > > > > > > PID's name, > > > > > > > > so to make the life easier also notify what's the app's name in > > > > > > > > the user > > > > > > > > event. > > > > > > > > > > > > > > > > Signed-off-by: André Almeida > > > > > > > > [...] > > > > > > > > > > > > len = scnprintf(event_string, sizeof(event_string), > > > > > > > > "%s", "WEDGED="); > > > > > > > > @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device > > > > > > > > *dev, unsigned long method) > > > > > > > > drm_info(dev, "device wedged, %s\n", method == > > > > > > > > DRM_WEDGE_RECOVERY_NONE ? > > > > > > > > "but recovered through reset" : "needs > > > > > > > > recovery"); > > > > > > > > + if (info) { > > > > > > > > + snprintf(pid_string, sizeof(pid_string), > > > > > > > > "PID=%u", info->pid); > > > > > > > > + snprintf(comm_string, sizeof(comm_string), > > > > > > > > "APP=%s", info->comm); > > > > > > > > + } else { > > > > > > > > + snprintf(pid_string, sizeof(pid_string), "%s", > > > > > > > > "PID=-1"); > > > > > > > > + snprintf(comm_string, sizeof(comm_string), > > > > > > > > "%s", "APP=none"); > > > > > > > > + } > > > > > > > This is not much use for wedge cases that needs recovery, since > > > > > > > at that point > > > > > > > the userspace will need to clean house anyway. > > > > > > > > > > > > > > Which leaves us with only 'none' case and perhaps the need for > > > > > > > standardization > > > > > > > of "optional telemetry collection". > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > I had the feeling that 'none' was already meant to be used for > > > > > > that. Do you > > > > > > think we should move to another naming? Given that we didn't reach > > > > > > the merge > > > > > > window yet we could potentially change that name without much > > > > > > damage. > > > > > > > > > > No, I meant thoughts on possible telemetry data that the drivers might > > > > > think is useful for userspace (along with PID) and can be presented in > > > > > a vendor agnostic manner (just like wedged event). > > > > > > > > I'm not if I agree that this will only be used for telemetry and for the > > > > `none` use case. As stated by Xaver, there's use case to know which app > > > > caused the device to get wedged (like switching to software rendering) > > > > and > > > > to display something for the user after the recovery is done (e.g. "The > > > > game > > > > stopped working and Plasma has reset"). > > > > > > Sure, but since this information is already available in coredump, I was > > > hoping to have something like a standardized DRM level coredump with both > > > vendor specific and agnostic sections, which the drivers can (and > > > hopefully > > > transition to) use in conjunction with wedged event to provide wider > > > telemetry and is useful for all wedge cases. > > > > This is more useful because, > > > > 1. It gives drivers an opportunity to present the telemetry that they are > > interested in without needing to add a new event string (like PID or > > APP) > > for their case. > > > > 2. When we consider wedging as a usecase, there's a lot more that goes > > into it than an application that might be behaving strangely. So a wider > > telemetry is what I would hope to look at in such a scenario. > > > > I agree that coredump is the way to go for telemetry, we already have the > name and PID of the guilty app there, along with more information about the > GPU state. But I don't think it sh
[PATCH 05/11] drm/amdgpu/mes: update hqd masks when disable_kq is set
Make all resources available to user queues. Suggested-by: Sunil Khatri Reviewed-by: Sunil Khatri Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c index 5abc1ca0fee98..971bf01fe46a9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c @@ -120,21 +120,21 @@ int amdgpu_mes_init(struct amdgpu_device *adev) * Set GFX pipe 0 queue 1-7 for MES scheduling * mask = 1110b */ - adev->mes.gfx_hqd_mask[i] = 0xFE; + adev->mes.gfx_hqd_mask[i] = adev->gfx.disable_kq ? 0xFF : 0xFE; else /* * GFX pipe 0 queue 0 is being used by Kernel queue. * Set GFX pipe 0 queue 1 for MES scheduling * mask = 10b */ - adev->mes.gfx_hqd_mask[i] = 0x2; + adev->mes.gfx_hqd_mask[i] = adev->gfx.disable_kq ? 0x3 : 0x2; } for (i = 0; i < AMDGPU_MES_MAX_COMPUTE_PIPES; i++) { /* use only 1st MEC pipes */ if (i >= adev->gfx.mec.num_pipe_per_mec) continue; - adev->mes.compute_hqd_mask[i] = 0xc; + adev->mes.compute_hqd_mask[i] = adev->gfx.disable_kq ? 0xF : 0xC; } for (i = 0; i < AMDGPU_MES_MAX_SDMA_PIPES; i++) { -- 2.48.1
[PATCH 02/11] drm/amdgpu: add ring flag for no user submissions
This would be set by IPs which only accept submissions from the kernel, not userspace, such as when kernel queues are disabled. Don't expose the rings to userspace and reject any submissions in the CS IOCTL. Reviewed-by: Sunil Khatri Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 30 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 +- 3 files changed, 25 insertions(+), 11 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 5df21529b3b13..5cc18034b75df 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -349,6 +349,10 @@ static int amdgpu_cs_p2_ib(struct amdgpu_cs_parser *p, ring = amdgpu_job_ring(job); ib = &job->ibs[job->num_ibs++]; + /* submissions to kernel queus are disabled */ + if (ring->no_user_submission) + return -EINVAL; + /* MM engine doesn't support user fences */ if (p->uf_bo && ring->funcs->no_user_fence) return -EINVAL; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index cd6eb7a3bc58a..3b7dfd56ccd0e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c @@ -408,7 +408,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, case AMDGPU_HW_IP_GFX: type = AMD_IP_BLOCK_TYPE_GFX; for (i = 0; i < adev->gfx.num_gfx_rings; i++) - if (adev->gfx.gfx_ring[i].sched.ready) + if (adev->gfx.gfx_ring[i].sched.ready && + !adev->gfx.gfx_ring[i].no_user_submission) ++num_rings; ib_start_alignment = 32; ib_size_alignment = 32; @@ -416,7 +417,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, case AMDGPU_HW_IP_COMPUTE: type = AMD_IP_BLOCK_TYPE_GFX; for (i = 0; i < adev->gfx.num_compute_rings; i++) - if (adev->gfx.compute_ring[i].sched.ready) + if (adev->gfx.compute_ring[i].sched.ready && + !adev->gfx.compute_ring[i].no_user_submission) ++num_rings; ib_start_alignment = 32; ib_size_alignment = 32; @@ -424,7 +426,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, case AMDGPU_HW_IP_DMA: type = AMD_IP_BLOCK_TYPE_SDMA; for (i = 0; i < adev->sdma.num_instances; i++) - if (adev->sdma.instance[i].ring.sched.ready) + if (adev->sdma.instance[i].ring.sched.ready && + !adev->gfx.gfx_ring[i].no_user_submission) ++num_rings; ib_start_alignment = 256; ib_size_alignment = 4; @@ -435,7 +438,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, if (adev->uvd.harvest_config & (1 << i)) continue; - if (adev->uvd.inst[i].ring.sched.ready) + if (adev->uvd.inst[i].ring.sched.ready && + !adev->uvd.inst[i].ring.no_user_submission) ++num_rings; } ib_start_alignment = 256; @@ -444,7 +448,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, case AMDGPU_HW_IP_VCE: type = AMD_IP_BLOCK_TYPE_VCE; for (i = 0; i < adev->vce.num_rings; i++) - if (adev->vce.ring[i].sched.ready) + if (adev->vce.ring[i].sched.ready && + !adev->vce.ring[i].no_user_submission) ++num_rings; ib_start_alignment = 256; ib_size_alignment = 4; @@ -456,7 +461,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, continue; for (j = 0; j < adev->uvd.num_enc_rings; j++) - if (adev->uvd.inst[i].ring_enc[j].sched.ready) + if (adev->uvd.inst[i].ring_enc[j].sched.ready && + !adev->uvd.inst[i].ring_enc[j].no_user_submission) ++num_rings; } ib_start_alignment = 256; @@ -468,7 +474,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, if (adev->vcn.harvest_config & (1 << i)) continue; - if (adev->vcn.inst[i].ring_dec.sched.ready) + if (adev->vcn.inst[i].ring_dec.sched.ready && + !adev->vcn.inst[i].ring_dec.no_user_submission)
[PATCH 06/11] drm/amdgpu/mes: make more vmids available when disable_kq=1
If we don't have kernel queues, the vmids can be used by the MES for user queues. Reviewed-by: Sunil Khatri Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c index 971bf01fe46a9..a536a78342a09 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c @@ -106,7 +106,7 @@ int amdgpu_mes_init(struct amdgpu_device *adev) adev->mes.total_max_queue = AMDGPU_FENCE_MES_QUEUE_ID_MASK; adev->mes.vmid_mask_mmhub = 0xff00; - adev->mes.vmid_mask_gfxhub = 0xff00; + adev->mes.vmid_mask_gfxhub = adev->gfx.disable_kq ? 0xfffe : 0xff00; for (i = 0; i < AMDGPU_MES_MAX_GFX_PIPES; i++) { /* use only 1st ME pipe */ diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c index 95d894a231fcf..19a5f196829f3 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c @@ -900,7 +900,7 @@ static int gmc_v10_0_sw_init(struct amdgpu_ip_block *ip_block) * amdgpu graphics/compute will use VMIDs 1-7 * amdkfd will use VMIDs 8-15 */ - adev->vm_manager.first_kfd_vmid = 8; + adev->vm_manager.first_kfd_vmid = adev->gfx.disable_kq ? 1 : 8; amdgpu_vm_manager_init(adev); diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c index bf8d01da88154..a2f6c9f4ebf2f 100644 --- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c @@ -838,7 +838,7 @@ static int gmc_v12_0_sw_init(struct amdgpu_ip_block *ip_block) * amdgpu graphics/compute will use VMIDs 1-7 * amdkfd will use VMIDs 8-15 */ - adev->vm_manager.first_kfd_vmid = 8; + adev->vm_manager.first_kfd_vmid = adev->gfx.disable_kq ? 1 : 8; amdgpu_vm_manager_init(adev); -- 2.48.1
[PATCH 10/11] drm/amdgpu/sdma6: add support for disable_kq
When the parameter is set, disable user submissions to kernel queues. Reviewed-by: Sunil Khatri Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c index 3aa4fec4d9e4a..bcc72737f8084 100644 --- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c @@ -1304,6 +1304,9 @@ static int sdma_v6_0_early_init(struct amdgpu_ip_block *ip_block) struct amdgpu_device *adev = ip_block->adev; int r; + if (amdgpu_disable_kq == 1) + adev->sdma.no_user_submission = true; + r = amdgpu_sdma_init_microcode(adev, 0, true); if (r) return r; @@ -1338,6 +1341,7 @@ static int sdma_v6_0_sw_init(struct amdgpu_ip_block *ip_block) ring->ring_obj = NULL; ring->use_doorbell = true; ring->me = i; + ring->no_user_submission = adev->sdma.no_user_submission; DRM_DEBUG("SDMA %d use_doorbell being set to: [%s]\n", i, ring->use_doorbell?"true":"false"); -- 2.48.1
[PATCH 07/11] drm/amdgpu/gfx11: add support for disable_kq
Plumb in support for disabling kernel queues in GFX11. We have to bring up a GFX queue briefly in order to initialize the clear state. After that we can disable it. v2: use ring counts per Felix' suggestion v3: fix stream fault handler, enable EOP interrupts v4: fix MEC interrupt offset (Sunil) Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 191 ++--- 1 file changed, 136 insertions(+), 55 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c index 95eefd9a40d28..fde8464cbd3b3 100644 --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c @@ -1145,6 +1145,10 @@ static int gfx_v11_0_gfx_ring_init(struct amdgpu_device *adev, int ring_id, ring->ring_obj = NULL; ring->use_doorbell = true; + if (adev->gfx.disable_kq) { + ring->no_scheduler = true; + ring->no_user_submission = true; + } if (!ring_id) ring->doorbell_index = adev->doorbell_index.gfx_ring0 << 1; @@ -1577,7 +1581,7 @@ static void gfx_v11_0_alloc_ip_dump(struct amdgpu_device *adev) static int gfx_v11_0_sw_init(struct amdgpu_ip_block *ip_block) { - int i, j, k, r, ring_id = 0; + int i, j, k, r, ring_id; int xcc_id = 0; struct amdgpu_device *adev = ip_block->adev; @@ -1710,37 +1714,42 @@ static int gfx_v11_0_sw_init(struct amdgpu_ip_block *ip_block) return r; } - /* set up the gfx ring */ - for (i = 0; i < adev->gfx.me.num_me; i++) { - for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) { - for (k = 0; k < adev->gfx.me.num_pipe_per_me; k++) { - if (!amdgpu_gfx_is_me_queue_enabled(adev, i, k, j)) - continue; - - r = gfx_v11_0_gfx_ring_init(adev, ring_id, - i, k, j); - if (r) - return r; - ring_id++; + if (adev->gfx.num_gfx_rings) { + ring_id = 0; + /* set up the gfx ring */ + for (i = 0; i < adev->gfx.me.num_me; i++) { + for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) { + for (k = 0; k < adev->gfx.me.num_pipe_per_me; k++) { + if (!amdgpu_gfx_is_me_queue_enabled(adev, i, k, j)) + continue; + + r = gfx_v11_0_gfx_ring_init(adev, ring_id, + i, k, j); + if (r) + return r; + ring_id++; + } } } } - ring_id = 0; - /* set up the compute queues - allocate horizontally across pipes */ - for (i = 0; i < adev->gfx.mec.num_mec; ++i) { - for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; j++) { - for (k = 0; k < adev->gfx.mec.num_pipe_per_mec; k++) { - if (!amdgpu_gfx_is_mec_queue_enabled(adev, 0, i, -k, j)) - continue; + if (adev->gfx.num_compute_rings) { + ring_id = 0; + /* set up the compute queues - allocate horizontally across pipes */ + for (i = 0; i < adev->gfx.mec.num_mec; ++i) { + for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; j++) { + for (k = 0; k < adev->gfx.mec.num_pipe_per_mec; k++) { + if (!amdgpu_gfx_is_mec_queue_enabled(adev, 0, i, +k, j)) + continue; - r = gfx_v11_0_compute_ring_init(adev, ring_id, - i, k, j); - if (r) - return r; + r = gfx_v11_0_compute_ring_init(adev, ring_id, + i, k, j); + if (r) + return r; - ring_id++; + ring_id++; + } } } } @@ -4578,11 +4587,23 @@ static int gfx_v11_0_cp_resume(struct amdgpu_device *adev)
[PATCH 11/11] drm/amdgpu/sdma7: add support for disable_kq
When the parameter is set, disable user submissions to kernel queues. Reviewed-by: Sunil Khatri Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c index 92a79296708ae..40d45f738c0a8 100644 --- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c @@ -1316,6 +1316,9 @@ static int sdma_v7_0_early_init(struct amdgpu_ip_block *ip_block) struct amdgpu_device *adev = ip_block->adev; int r; + if (amdgpu_disable_kq == 1) + adev->sdma.no_user_submission = true; + r = amdgpu_sdma_init_microcode(adev, 0, true); if (r) { DRM_ERROR("Failed to init sdma firmware!\n"); @@ -1351,6 +1354,7 @@ static int sdma_v7_0_sw_init(struct amdgpu_ip_block *ip_block) ring->ring_obj = NULL; ring->use_doorbell = true; ring->me = i; + ring->no_user_submission = adev->sdma.no_user_submission; DRM_DEBUG("SDMA %d use_doorbell being set to: [%s]\n", i, ring->use_doorbell?"true":"false"); -- 2.48.1
[PATCH 03/11] drm/amdgpu/gfx: add generic handling for disable_kq
Add proper checks for disable_kq functionality in gfx helper functions. Add special logic for families that require the clear state setup. v2: use ring count as per Felix suggestion v3: fix num_gfx_rings handling in amdgpu_gfx_graphics_queue_acquire() Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 8 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 2 ++ 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c index 984e6ff6e4632..a08243dd0798e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c @@ -258,8 +258,9 @@ void amdgpu_gfx_graphics_queue_acquire(struct amdgpu_device *adev) } /* update the number of active graphics rings */ - adev->gfx.num_gfx_rings = - bitmap_weight(adev->gfx.me.queue_bitmap, AMDGPU_MAX_GFX_QUEUES); + if (adev->gfx.num_gfx_rings) + adev->gfx.num_gfx_rings = + bitmap_weight(adev->gfx.me.queue_bitmap, AMDGPU_MAX_GFX_QUEUES); } static int amdgpu_gfx_kiq_acquire(struct amdgpu_device *adev, @@ -1544,6 +1545,9 @@ static ssize_t amdgpu_gfx_set_run_cleaner_shader(struct device *dev, if (adev->in_suspend && !adev->in_runpm) return -EPERM; + if (adev->gfx.disable_kq) + return -ENOTSUPP; + ret = kstrtol(buf, 0, &value); if (ret) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h index ddf4533614bac..8fa68a4ac34f1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h @@ -483,6 +483,8 @@ struct amdgpu_gfx { atomic_ttotal_submission_cnt; struct delayed_work idle_work; + + booldisable_kq; }; struct amdgpu_gfx_ras_reg_entry { -- 2.48.1
[PATCH 09/11] drm/amdgpu/sdma: add flag for tracking disable_kq
For SDMA, we still need kernel queues for paging so they need to be initialized, but we no not want to accept submissions from userspace when disable_kq is set. Reviewed-by: Sunil Khatri Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h index 9651693200655..edc856e10337a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h @@ -129,6 +129,7 @@ struct amdgpu_sdma { /* track guilty state of GFX and PAGE queues */ bool gfx_guilty; bool page_guilty; + boolno_user_submission; }; /* -- 2.48.1
[PATCH V5 00/11] Add disable kernel queue support
To better evaluate user queues, add a module parameter to disable kernel queues. With this set kernel queues are disabled and only user queues are available. This frees up hardware resources for use in user queues which would otherwise be used by kernel queues and provides a way to validate user queues without the presence of kernel queues. v2: use num_gfx_rings and num_compute_rings per Felix suggestion v3: include num_gfx_rings fix in amdgpu_gfx.c v4: additional fixes v5: MEC EOP interrupt handling fix (Sunil) Alex Deucher (11): drm/amdgpu: add parameter to disable kernel queues drm/amdgpu: add ring flag for no user submissions drm/amdgpu/gfx: add generic handling for disable_kq drm/amdgpu/mes: centralize gfx_hqd mask management drm/amdgpu/mes: update hqd masks when disable_kq is set drm/amdgpu/mes: make more vmids available when disable_kq=1 drm/amdgpu/gfx11: add support for disable_kq drm/amdgpu/gfx12: add support for disable_kq drm/amdgpu/sdma: add flag for tracking disable_kq drm/amdgpu/sdma6: add support for disable_kq drm/amdgpu/sdma7: add support for disable_kq drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 + drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 9 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 8 +- drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 2 + drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 30 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 26 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h | 1 + drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 191 --- drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 183 +++--- drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c | 2 +- drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 16 +- drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 15 +- drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 4 + drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 4 + 17 files changed, 345 insertions(+), 155 deletions(-) -- 2.48.1
[PATCH 01/11] drm/amdgpu: add parameter to disable kernel queues
On chips that support user queues, setting this option will disable kernel queues to be used to validate user queues without kernel queues. Reviewed-by: Prike Liang Reviewed-by: Sunil Khatri Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 9 + 2 files changed, 10 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h index 87062c1adcdf7..45437a8f29d3b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h @@ -270,6 +270,7 @@ extern int amdgpu_user_partt_mode; extern int amdgpu_agp; extern int amdgpu_wbrf; +extern int amdgpu_disable_kq; #define AMDGPU_VM_MAX_NUM_CTX 4096 #define AMDGPU_SG_THRESHOLD(256*1024*1024) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index ba869fa99..f50a25fb60376 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -238,6 +238,7 @@ int amdgpu_agp = -1; /* auto */ int amdgpu_wbrf = -1; int amdgpu_damage_clips = -1; /* auto */ int amdgpu_umsch_mm_fwlog; +int amdgpu_disable_kq = -1; DECLARE_DYNDBG_CLASSMAP(drm_debug_classes, DD_CLASS_TYPE_DISJOINT_BITS, 0, "DRM_UT_CORE", @@ -1084,6 +1085,14 @@ MODULE_PARM_DESC(wbrf, "Enable Wifi RFI interference mitigation (0 = disabled, 1 = enabled, -1 = auto(default)"); module_param_named(wbrf, amdgpu_wbrf, int, 0444); +/** + * DOC: disable_kq (int) + * Disable kernel queues on systems that support user queues. + * (0 = kernel queues enabled, 1 = kernel queues disabled, -1 = auto (default setting)) + */ +MODULE_PARM_DESC(disable_kq, "Disable kernel queues (-1 = auto (default), 0 = enable KQ, 1 = disable KQ)"); +module_param_named(disable_kq, amdgpu_disable_kq, int, 0444); + /* These devices are not supported by amdgpu. * They are supported by the mach64, r128, radeon drivers */ -- 2.48.1
[PATCH 04/11] drm/amdgpu/mes: centralize gfx_hqd mask management
Move it to amdgpu_mes to align with the compute and sdma hqd masks. No functional change. v2: rebase on new changes Reviewed-by: Sunil Khatri Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 22 ++ drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 16 +++- drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 15 +++ 3 files changed, 28 insertions(+), 25 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c index 6f5e272d7ded3..5abc1ca0fee98 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c @@ -108,6 +108,28 @@ int amdgpu_mes_init(struct amdgpu_device *adev) adev->mes.vmid_mask_mmhub = 0xff00; adev->mes.vmid_mask_gfxhub = 0xff00; + for (i = 0; i < AMDGPU_MES_MAX_GFX_PIPES; i++) { + /* use only 1st ME pipe */ + if (i >= adev->gfx.me.num_pipe_per_me) + continue; + if (amdgpu_ip_version(adev, GC_HWIP, 0) >= + IP_VERSION(12, 0, 0)) + /* +* GFX V12 has only one GFX pipe, but 8 queues in it. +* GFX pipe 0 queue 0 is being used by Kernel queue. +* Set GFX pipe 0 queue 1-7 for MES scheduling +* mask = 1110b +*/ + adev->mes.gfx_hqd_mask[i] = 0xFE; + else + /* +* GFX pipe 0 queue 0 is being used by Kernel queue. +* Set GFX pipe 0 queue 1 for MES scheduling +* mask = 10b +*/ + adev->mes.gfx_hqd_mask[i] = 0x2; + } + for (i = 0; i < AMDGPU_MES_MAX_COMPUTE_PIPES; i++) { /* use only 1st MEC pipes */ if (i >= adev->gfx.mec.num_pipe_per_mec) diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c index a569d09a1a748..39b45d8b5f049 100644 --- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c @@ -669,18 +669,6 @@ static int mes_v11_0_misc_op(struct amdgpu_mes *mes, offsetof(union MESAPI__MISC, api_status)); } -static void mes_v11_0_set_gfx_hqd_mask(union MESAPI_SET_HW_RESOURCES *pkt) -{ - /* -* GFX pipe 0 queue 0 is being used by Kernel queue. -* Set GFX pipe 0 queue 1 for MES scheduling -* mask = 10b -* GFX pipe 1 can't be used for MES due to HW limitation. -*/ - pkt->gfx_hqd_mask[0] = 0x2; - pkt->gfx_hqd_mask[1] = 0; -} - static int mes_v11_0_set_hw_resources(struct amdgpu_mes *mes) { int i; @@ -705,7 +693,9 @@ static int mes_v11_0_set_hw_resources(struct amdgpu_mes *mes) mes_set_hw_res_pkt.compute_hqd_mask[i] = mes->compute_hqd_mask[i]; - mes_v11_0_set_gfx_hqd_mask(&mes_set_hw_res_pkt); + for (i = 0; i < MAX_GFX_PIPES; i++) + mes_set_hw_res_pkt.gfx_hqd_mask[i] = + mes->gfx_hqd_mask[i]; for (i = 0; i < MAX_SDMA_PIPES; i++) mes_set_hw_res_pkt.sdma_hqd_mask[i] = mes->sdma_hqd_mask[i]; diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c index 96336652d14c5..519f054bec60d 100644 --- a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c +++ b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c @@ -694,17 +694,6 @@ static int mes_v12_0_set_hw_resources_1(struct amdgpu_mes *mes, int pipe) offsetof(union MESAPI_SET_HW_RESOURCES_1, api_status)); } -static void mes_v12_0_set_gfx_hqd_mask(union MESAPI_SET_HW_RESOURCES *pkt) -{ - /* -* GFX V12 has only one GFX pipe, but 8 queues in it. -* GFX pipe 0 queue 0 is being used by Kernel queue. -* Set GFX pipe 0 queue 1-7 for MES scheduling -* mask = 1110b -*/ - pkt->gfx_hqd_mask[0] = 0xFE; -} - static int mes_v12_0_set_hw_resources(struct amdgpu_mes *mes, int pipe) { int i; @@ -727,7 +716,9 @@ static int mes_v12_0_set_hw_resources(struct amdgpu_mes *mes, int pipe) mes_set_hw_res_pkt.compute_hqd_mask[i] = mes->compute_hqd_mask[i]; - mes_v12_0_set_gfx_hqd_mask(&mes_set_hw_res_pkt); + for (i = 0; i < MAX_GFX_PIPES; i++) + mes_set_hw_res_pkt.gfx_hqd_mask[i] = + mes->gfx_hqd_mask[i]; for (i = 0; i < MAX_SDMA_PIPES; i++) mes_set_hw_res_pkt.sdma_hqd_mask[i] = -- 2.48.1
Re: [PATCH] drm/amd/display: avoid NPD when ASIC does not support DMUB
On 03/13, Thadeu Lima de Souza Cascardo wrote: > On Wed, Feb 05, 2025 at 10:06:38AM -0300, Thadeu Lima de Souza Cascardo wrote: > > ctx->dmub_srv will de NULL if the ASIC does not support DMUB, which is > > tested in dm_dmub_sw_init. > > > > However, it will be dereferenced in dmub_hw_lock_mgr_cmd if > > should_use_dmub_lock returns true. > > > > This has been the case since dmub support has been added for PSR1. > > This bug has landed on stable trees. Any chance for a review here? > > Thanks. > Cascardo. > > > > > Fix this by checking for dmub_srv in should_use_dmub_lock. > > > > [ 37.440832] BUG: kernel NULL pointer dereference, address: > > 0058 > > [ 37.447808] #PF: supervisor read access in kernel mode > > [ 37.452959] #PF: error_code(0x) - not-present page > > [ 37.458112] PGD 0 P4D 0 > > [ 37.460662] Oops: Oops: [#1] PREEMPT SMP NOPTI > > [ 37.465553] CPU: 2 UID: 1000 PID: 1745 Comm: DrmThread Not tainted > > 6.14.0-rc1-3-gd62e938120f0 #23 99720e1cb1e0fc4773b8513150932a07de3c6e88 > > [ 37.478324] Hardware name: Google Morphius/Morphius, BIOS > > Google_Morphius.13434.858.0 10/26/2023 > > [ 37.487103] RIP: 0010:dmub_hw_lock_mgr_cmd+0x77/0xb0 > > [ 37.492074] Code: 44 24 0e 00 00 00 00 48 c7 04 24 45 00 00 0c 40 88 74 > > 24 0d 0f b6 02 88 44 24 0c 8b 01 89 44 24 08 85 f6 75 05 c6 44 24 0e 01 > > <48> 8b 7f 58 48 89 e6 ba 01 00 00 00 e8 08 3c 2a 00 65 48 8b 04 5 > > [ 37.510822] RSP: 0018:969442853300 EFLAGS: 00010202 > > [ 37.516052] RAX: RBX: 92db0300 RCX: > > 969442853358 > > [ 37.523185] RDX: 969442853368 RSI: 0001 RDI: > > > > [ 37.530322] RBP: 0001 R08: 04a7 R09: > > 04a5 > > [ 37.537453] R10: 0476 R11: 0062 R12: > > 92db0ade8000 > > [ 37.544589] R13: 92da01180ae0 R14: 92da011802a8 R15: > > 92db0300 > > [ 37.551725] FS: 784a9cdfc6c0() GS:92db2af0() > > knlGS: > > [ 37.559814] CS: 0010 DS: ES: CR0: 80050033 > > [ 37.565562] CR2: 0058 CR3: 000112b1c000 CR4: > > 003506f0 > > [ 37.572697] Call Trace: > > [ 37.575152] > > [ 37.577258] ? __die_body+0x66/0xb0 > > [ 37.580756] ? page_fault_oops+0x3e7/0x4a0 > > [ 37.584861] ? exc_page_fault+0x3e/0xe0 > > [ 37.588706] ? exc_page_fault+0x5c/0xe0 > > [ 37.592550] ? asm_exc_page_fault+0x22/0x30 > > [ 37.596742] ? dmub_hw_lock_mgr_cmd+0x77/0xb0 > > [ 37.601107] dcn10_cursor_lock+0x1e1/0x240 > > [ 37.605211] program_cursor_attributes+0x81/0x190 > > [ 37.609923] commit_planes_for_stream+0x998/0x1ef0 > > [ 37.614722] update_planes_and_stream_v2+0x41e/0x5c0 > > [ 37.619703] dc_update_planes_and_stream+0x78/0x140 > > [ 37.624588] amdgpu_dm_atomic_commit_tail+0x4362/0x49f0 > > [ 37.629832] ? srso_return_thunk+0x5/0x5f > > [ 37.633847] ? mark_held_locks+0x6d/0xd0 > > [ 37.637774] ? _raw_spin_unlock_irq+0x24/0x50 > > [ 37.642135] ? srso_return_thunk+0x5/0x5f > > [ 37.646148] ? lockdep_hardirqs_on+0x95/0x150 > > [ 37.650510] ? srso_return_thunk+0x5/0x5f > > [ 37.654522] ? _raw_spin_unlock_irq+0x2f/0x50 > > [ 37.658883] ? srso_return_thunk+0x5/0x5f > > [ 37.662897] ? wait_for_common+0x186/0x1c0 > > [ 37.666998] ? srso_return_thunk+0x5/0x5f > > [ 37.671009] ? drm_crtc_next_vblank_start+0xc3/0x170 > > [ 37.675983] commit_tail+0xf5/0x1c0 > > [ 37.679478] drm_atomic_helper_commit+0x2a2/0x2b0 > > [ 37.684186] drm_atomic_commit+0xd6/0x100 > > [ 37.688199] ? __cfi___drm_printfn_info+0x10/0x10 > > [ 37.692911] drm_atomic_helper_update_plane+0xe5/0x130 > > [ 37.698054] drm_mode_cursor_common+0x501/0x670 > > [ 37.702600] ? __cfi_drm_mode_cursor_ioctl+0x10/0x10 > > [ 37.707572] drm_mode_cursor_ioctl+0x48/0x70 > > [ 37.711851] drm_ioctl_kernel+0xf2/0x150 > > [ 37.715781] drm_ioctl+0x363/0x590 > > [ 37.719189] ? __cfi_drm_mode_cursor_ioctl+0x10/0x10 > > [ 37.724165] amdgpu_drm_ioctl+0x41/0x80 > > [ 37.728013] __se_sys_ioctl+0x7f/0xd0 > > [ 37.731685] do_syscall_64+0x87/0x100 > > [ 37.735355] ? vma_end_read+0x12/0xe0 > > [ 37.739024] ? srso_return_thunk+0x5/0x5f > > [ 37.743041] ? find_held_lock+0x47/0xf0 > > [ 37.746884] ? vma_end_read+0x12/0xe0 > > [ 37.750552] ? srso_return_thunk+0x5/0x5f > > [ 37.754565] ? lock_release+0x1c4/0x2e0 > > [ 37.758406] ? vma_end_read+0x12/0xe0 > > [ 37.762079] ? exc_page_fault+0x84/0xe0 > > [ 37.765921] ? srso_return_thunk+0x5/0x5f > > [ 37.769938] ? lockdep_hardirqs_on+0x95/0x150 > > [ 37.774303] ? srso_return_thunk+0x5/0x5f > > [ 37.778317] ? exc_page_fault+0x84/0xe0 > > [ 37.782163] entry_SYSCALL_64_after_hwframe+0x55/0x5d > > [ 37.787218] RIP: 0033:0x784aa5ec3059 > > [ 37.790803] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 > > b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 4
Re: [PATCH] drm/amd/display: avoid NPD when ASIC does not support DMUB
Applied. Thanks. Alex On Thu, Mar 13, 2025 at 11:17 AM Leo Li wrote: > > > > On 2025-03-13 07:29, Thadeu Lima de Souza Cascardo wrote: > > On Wed, Feb 05, 2025 at 10:06:38AM -0300, Thadeu Lima de Souza Cascardo > > wrote: > >> ctx->dmub_srv will de NULL if the ASIC does not support DMUB, which is > >> tested in dm_dmub_sw_init. > >> > >> However, it will be dereferenced in dmub_hw_lock_mgr_cmd if > >> should_use_dmub_lock returns true. > >> > >> This has been the case since dmub support has been added for PSR1. > > > > This bug has landed on stable trees. Any chance for a review here? > > > > Thanks. > > Cascardo. > > Thanks for the ping and fix! > > Reviewed-by: Leo Li > > > > >> > >> Fix this by checking for dmub_srv in should_use_dmub_lock. > >> > >> [ 37.440832] BUG: kernel NULL pointer dereference, address: > >> 0058 > >> [ 37.447808] #PF: supervisor read access in kernel mode > >> [ 37.452959] #PF: error_code(0x) - not-present page > >> [ 37.458112] PGD 0 P4D 0 > >> [ 37.460662] Oops: Oops: [#1] PREEMPT SMP NOPTI > >> [ 37.465553] CPU: 2 UID: 1000 PID: 1745 Comm: DrmThread Not tainted > >> 6.14.0-rc1-3-gd62e938120f0 #23 99720e1cb1e0fc4773b8513150932a07de3c6e88 > >> [ 37.478324] Hardware name: Google Morphius/Morphius, BIOS > >> Google_Morphius.13434.858.0 10/26/2023 > >> [ 37.487103] RIP: 0010:dmub_hw_lock_mgr_cmd+0x77/0xb0 > >> [ 37.492074] Code: 44 24 0e 00 00 00 00 48 c7 04 24 45 00 00 0c 40 88 74 > >> 24 0d 0f b6 02 88 44 24 0c 8b 01 89 44 24 08 85 f6 75 05 c6 44 24 0e 01 > >> <48> 8b 7f 58 48 89 e6 ba 01 00 00 00 e8 08 3c 2a 00 65 48 8b 04 5 > >> [ 37.510822] RSP: 0018:969442853300 EFLAGS: 00010202 > >> [ 37.516052] RAX: RBX: 92db0300 RCX: > >> 969442853358 > >> [ 37.523185] RDX: 969442853368 RSI: 0001 RDI: > >> > >> [ 37.530322] RBP: 0001 R08: 04a7 R09: > >> 04a5 > >> [ 37.537453] R10: 0476 R11: 0062 R12: > >> 92db0ade8000 > >> [ 37.544589] R13: 92da01180ae0 R14: 92da011802a8 R15: > >> 92db0300 > >> [ 37.551725] FS: 784a9cdfc6c0() GS:92db2af0() > >> knlGS: > >> [ 37.559814] CS: 0010 DS: ES: CR0: 80050033 > >> [ 37.565562] CR2: 0058 CR3: 000112b1c000 CR4: > >> 003506f0 > >> [ 37.572697] Call Trace: > >> [ 37.575152] > >> [ 37.577258] ? __die_body+0x66/0xb0 > >> [ 37.580756] ? page_fault_oops+0x3e7/0x4a0 > >> [ 37.584861] ? exc_page_fault+0x3e/0xe0 > >> [ 37.588706] ? exc_page_fault+0x5c/0xe0 > >> [ 37.592550] ? asm_exc_page_fault+0x22/0x30 > >> [ 37.596742] ? dmub_hw_lock_mgr_cmd+0x77/0xb0 > >> [ 37.601107] dcn10_cursor_lock+0x1e1/0x240 > >> [ 37.605211] program_cursor_attributes+0x81/0x190 > >> [ 37.609923] commit_planes_for_stream+0x998/0x1ef0 > >> [ 37.614722] update_planes_and_stream_v2+0x41e/0x5c0 > >> [ 37.619703] dc_update_planes_and_stream+0x78/0x140 > >> [ 37.624588] amdgpu_dm_atomic_commit_tail+0x4362/0x49f0 > >> [ 37.629832] ? srso_return_thunk+0x5/0x5f > >> [ 37.633847] ? mark_held_locks+0x6d/0xd0 > >> [ 37.637774] ? _raw_spin_unlock_irq+0x24/0x50 > >> [ 37.642135] ? srso_return_thunk+0x5/0x5f > >> [ 37.646148] ? lockdep_hardirqs_on+0x95/0x150 > >> [ 37.650510] ? srso_return_thunk+0x5/0x5f > >> [ 37.654522] ? _raw_spin_unlock_irq+0x2f/0x50 > >> [ 37.658883] ? srso_return_thunk+0x5/0x5f > >> [ 37.662897] ? wait_for_common+0x186/0x1c0 > >> [ 37.666998] ? srso_return_thunk+0x5/0x5f > >> [ 37.671009] ? drm_crtc_next_vblank_start+0xc3/0x170 > >> [ 37.675983] commit_tail+0xf5/0x1c0 > >> [ 37.679478] drm_atomic_helper_commit+0x2a2/0x2b0 > >> [ 37.684186] drm_atomic_commit+0xd6/0x100 > >> [ 37.688199] ? __cfi___drm_printfn_info+0x10/0x10 > >> [ 37.692911] drm_atomic_helper_update_plane+0xe5/0x130 > >> [ 37.698054] drm_mode_cursor_common+0x501/0x670 > >> [ 37.702600] ? __cfi_drm_mode_cursor_ioctl+0x10/0x10 > >> [ 37.707572] drm_mode_cursor_ioctl+0x48/0x70 > >> [ 37.711851] drm_ioctl_kernel+0xf2/0x150 > >> [ 37.715781] drm_ioctl+0x363/0x590 > >> [ 37.719189] ? __cfi_drm_mode_cursor_ioctl+0x10/0x10 > >> [ 37.724165] amdgpu_drm_ioctl+0x41/0x80 > >> [ 37.728013] __se_sys_ioctl+0x7f/0xd0 > >> [ 37.731685] do_syscall_64+0x87/0x100 > >> [ 37.735355] ? vma_end_read+0x12/0xe0 > >> [ 37.739024] ? srso_return_thunk+0x5/0x5f > >> [ 37.743041] ? find_held_lock+0x47/0xf0 > >> [ 37.746884] ? vma_end_read+0x12/0xe0 > >> [ 37.750552] ? srso_return_thunk+0x5/0x5f > >> [ 37.754565] ? lock_release+0x1c4/0x2e0 > >> [ 37.758406] ? vma_end_read+0x12/0xe0 > >> [ 37.762079] ? exc_page_fault+0x84/0xe0 > >> [ 37.765921] ? srso_return_thunk+0x5/0x5f > >> [ 37.769938] ? lockdep_hardirqs_on+0x95/0x150 > >> [ 37.774303] ? srso_return_thunk+0x5/0x5f > >> [ 37.778317] ? exc_p
Re: [RFC PATCH 2/7] drm/amd/display: start using drm_edid helpers to parse EDID caps
On 3/8/25 07:26, Melissa Wen wrote: Groundwork that allocates a temporary drm_edid from raw edid to take advantage of DRM common-code helpers instead of driver-specific code. Signed-off-by: Melissa Wen --- .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c index 2cd35392e2da..7edc23267ee7 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c @@ -108,6 +108,7 @@ enum dc_edid_status dm_helpers_parse_edid_caps( struct drm_connector *connector = &aconnector->base; struct drm_device *dev = connector->dev; struct edid *edid_buf = edid ? (struct edid *) edid->raw_edid : NULL; + struct drm_edid *drm_edid; This declaration generates the below build warning: drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_helpers.c: In function ‘dm_helpers_parse_edid_caps’: drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_helpers.c:116:18: error: assignment discards ‘const’ qualifier from pointer target type [-Werror=discarded-qualifiers] 116 | drm_edid = drm_edid_alloc(edid_buf, EDID_LENGTH * (edid_buf->extensions + 1)); | ^ cc1: all warnings being treated as errors It can be fixed as the following change - struct drm_edid *drm_edid; + const struct drm_edid *drm_edid; struct cea_sad *sads; int sad_count = -1; int sadb_count = -1; @@ -116,10 +117,13 @@ enum dc_edid_status dm_helpers_parse_edid_caps( enum dc_edid_status result = EDID_OK; + if (!edid_caps || !edid) return EDID_BAD_INPUT; - if (!drm_edid_is_valid(edid_buf)) + drm_edid = drm_edid_alloc(edid_buf, EDID_LENGTH * (edid_buf->extensions + 1)); drm_edid_alloc returns "const struct drm_edid *", so drm_edid needs to be const as well. + + if (!drm_edid_valid(drm_edid)) result = EDID_BAD_CHECKSUM; edid_caps->manufacturer_id = (uint16_t) edid_buf->mfg_id[0] | @@ -139,8 +143,10 @@ enum dc_edid_status dm_helpers_parse_edid_caps( apply_edid_quirks(dev, edid_buf, edid_caps); sad_count = drm_edid_to_sad((struct edid *) edid->raw_edid, &sads); - if (sad_count <= 0) + if (sad_count <= 0) { + drm_edid_free(drm_edid); return result; + } edid_caps->audio_mode_count = min(sad_count, DC_MAX_AUDIO_DESC_COUNT); for (i = 0; i < edid_caps->audio_mode_count; ++i) { @@ -166,6 +172,7 @@ enum dc_edid_status dm_helpers_parse_edid_caps( kfree(sads); kfree(sadb); + drm_edid_free(drm_edid); return result; }
Re: commit 7ffb791423c7 breaks steam game
On 3/14/25 05:12, Bert Karwatzki wrote: > Am Donnerstag, dem 13.03.2025 um 22:47 +1100 schrieb Balbir Singh: >> >> >> Anyway, I think the nokaslr result is interesting, it seems like with nokaslr >> even the older kernels have problems with the game >> >> Could you confirm if with nokaslr >> > Now I've tested kernel 6.8.12 with nokaslr > >> 1. Only one single game stellaris is not working? >> 2. The entire laptop does not work? >> 3. Laptop works and other games work? Just one game is not working as > expected? > > > Stellaris is showing the input lag and the entire graphical user interface > shows > the same input lag as long as stellaris is running. > Civilization 6 shows the same input lag as stellaris, probably even worse. > Magic the Gathering: Arena (with wine) works normally. > Valheim also works normally. > Crusader Kings 2 works normally > Rogue Heroes: Ruins of Tasos (a Zelda lookalike) works normally. > Baldur's Gate I & II and Icewind Dale work normally. > > Also the input lag is only in the GUI, if I switch to a text console (ctrl + > alt > + Fn), input works normally even while the affected games are running. > > Games aside everything else (e.g. compiling kernels) seems to work with > nokaslr. > Would it be fair to assume that anything Xorg/Wayland is working fine when the game is not running, even with nokaslr? +amd-gfx@lists.freedesktop.org to see if there are known issues with nokaslr and the games you mentioned. Balbir Singh PS: I came across an interesting link https://www.alex-ionescu.com/behind-windows-x64s-44-bit-memory-addressing-limit/ I think SLIST_HEADER is used by wine as well for user space and I am not sure if in this situation the game is hitting this scenario, but surprisingly the other games are not. This is assuming the game uses wine. I am not sure it's related, but the 44 bits caught my attention.
Re: [PATCH 07/11] drm/amdgpu/gfx11: add support for disable_kq
On 03/13, Alex Deucher wrote: > Plumb in support for disabling kernel queues in > GFX11. We have to bring up a GFX queue briefly in > order to initialize the clear state. After that > we can disable it. > > v2: use ring counts per Felix' suggestion > v3: fix stream fault handler, enable EOP interrupts > v4: fix MEC interrupt offset (Sunil) > > Signed-off-by: Alex Deucher > --- > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 191 ++--- > 1 file changed, 136 insertions(+), 55 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > index 95eefd9a40d28..fde8464cbd3b3 100644 > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > @@ -1145,6 +1145,10 @@ static int gfx_v11_0_gfx_ring_init(struct > amdgpu_device *adev, int ring_id, > > ring->ring_obj = NULL; > ring->use_doorbell = true; > + if (adev->gfx.disable_kq) { > + ring->no_scheduler = true; Hi Alex, Just a question about this no_scheduler part. Set no_scheduler to true, means that all of the queues of GFX11 will not be preempted, right? I suppose you have to do it because you want to initialize the clear state? Thanks > + ring->no_user_submission = true; > + } > > if (!ring_id) > ring->doorbell_index = adev->doorbell_index.gfx_ring0 << 1; > @@ -1577,7 +1581,7 @@ static void gfx_v11_0_alloc_ip_dump(struct > amdgpu_device *adev) > > static int gfx_v11_0_sw_init(struct amdgpu_ip_block *ip_block) > { > - int i, j, k, r, ring_id = 0; > + int i, j, k, r, ring_id; > int xcc_id = 0; > struct amdgpu_device *adev = ip_block->adev; > > @@ -1710,37 +1714,42 @@ static int gfx_v11_0_sw_init(struct amdgpu_ip_block > *ip_block) > return r; > } > > - /* set up the gfx ring */ > - for (i = 0; i < adev->gfx.me.num_me; i++) { > - for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) { > - for (k = 0; k < adev->gfx.me.num_pipe_per_me; k++) { > - if (!amdgpu_gfx_is_me_queue_enabled(adev, i, k, > j)) > - continue; > - > - r = gfx_v11_0_gfx_ring_init(adev, ring_id, > - i, k, j); > - if (r) > - return r; > - ring_id++; > + if (adev->gfx.num_gfx_rings) { > + ring_id = 0; > + /* set up the gfx ring */ > + for (i = 0; i < adev->gfx.me.num_me; i++) { > + for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) { > + for (k = 0; k < adev->gfx.me.num_pipe_per_me; > k++) { > + if > (!amdgpu_gfx_is_me_queue_enabled(adev, i, k, j)) > + continue; > + > + r = gfx_v11_0_gfx_ring_init(adev, > ring_id, > + i, k, j); > + if (r) > + return r; > + ring_id++; > + } > } > } > } > > - ring_id = 0; > - /* set up the compute queues - allocate horizontally across pipes */ > - for (i = 0; i < adev->gfx.mec.num_mec; ++i) { > - for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; j++) { > - for (k = 0; k < adev->gfx.mec.num_pipe_per_mec; k++) { > - if (!amdgpu_gfx_is_mec_queue_enabled(adev, 0, i, > - k, j)) > - continue; > + if (adev->gfx.num_compute_rings) { > + ring_id = 0; > + /* set up the compute queues - allocate horizontally across > pipes */ > + for (i = 0; i < adev->gfx.mec.num_mec; ++i) { > + for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; j++) { > + for (k = 0; k < adev->gfx.mec.num_pipe_per_mec; > k++) { > + if > (!amdgpu_gfx_is_mec_queue_enabled(adev, 0, i, > + k, > j)) > + continue; > > - r = gfx_v11_0_compute_ring_init(adev, ring_id, > - i, k, j); > - if (r) > - return r; > + r = gfx_v11_0_compute_ring_init(adev, > ring_id, > + i, k, > j); > +
Re: [PATCH 11/11] drm/amdgpu/sdma7: add support for disable_kq
On 03/13, Alex Deucher wrote: > When the parameter is set, disable user submissions > to kernel queues. > > Reviewed-by: Sunil Khatri > Signed-off-by: Alex Deucher > --- > drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c > b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c > index 92a79296708ae..40d45f738c0a8 100644 > --- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c > +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c > @@ -1316,6 +1316,9 @@ static int sdma_v7_0_early_init(struct amdgpu_ip_block > *ip_block) > struct amdgpu_device *adev = ip_block->adev; > int r; > > + if (amdgpu_disable_kq == 1) > + adev->sdma.no_user_submission = true; > + > r = amdgpu_sdma_init_microcode(adev, 0, true); > if (r) { > DRM_ERROR("Failed to init sdma firmware!\n"); > @@ -1351,6 +1354,7 @@ static int sdma_v7_0_sw_init(struct amdgpu_ip_block > *ip_block) > ring->ring_obj = NULL; > ring->use_doorbell = true; > ring->me = i; > + ring->no_user_submission = adev->sdma.no_user_submission; > > DRM_DEBUG("SDMA %d use_doorbell being set to: [%s]\n", i, > ring->use_doorbell?"true":"false"); > -- > 2.48.1 > Hi Alex, I think patch 9-11 could be a squashed in a single one. Thanks -- Rodrigo Siqueira
Re: [RFC PATCH 0/7] drm/amd/display: more DRM edid common-code to the display driver
The series look fine to me, except one small error in patch 2. I can send this series to promotion tests once the error is addressed. Let me also check others for comments. Hi Harry and Leo, Do you have other concerns before I sent this series to promotion tests? On 3/8/25 07:26, Melissa Wen wrote: Hi, I've been working on a new version of [1] that ports the AMD display driver to use the common `drm_edid` code instead of open and raw edid handling. The part of the previous series that didn't make the cut was to replace the open coded edid parser with `drm_edid_product_id` and `eld` data. However, when addressing feedback I ran into a design issue in the latest version because I was trying to not add any Linux-specific code to the DC code, specifically, DC link detection. In short, we have a few paths in the DM code that allocate a temporary `drm_edid`, go to the DC, and back to the DM to handle `drm_edid` data. In the last version, I was storing this temporary `drm_edid` in the aconnector, but it was erroneously overriding a still in use `drm_edid`. In this new version I allocate a temporary `drm_edid` in the DM parser from raw edid data stored in `dc_edid`, which was actually extracted from another `drm_edid` in the previous DM caller. This is a workaround to bypass DC boundaries and parse edid capabilities, but I think we can do better if we can find a clean way to pass the `drm_edid` through this kind of DM -> DC -> DM paths. I checked working on top of Thomas' work[2] that replaces `dc_edid` by `drm_edid` and adds this DRM struct and its helpers inside DC. The resulted changes work stable and as expected[3], but I believe that adding linux-specific code to DC is not desirable. Siqueira and I have discussed other alternatives, such as updating the DC code to match `drm_edid` structs or checking how well the change in this series could work with `drm_edid` as a private obj[4], however we would like to know AMD team's opinion before making this big effort (and probably noisy change). The main goal here is removing `drm_edid_raw` calls and duplicated code to handle edid in DC/link_detection that can take advantage of DRM common-code instead. Please, let me know your thoughts. Melissa [1] https://lore.kernel.org/amd-gfx/20240918213845.158293-1-mario.limoncie...@amd.com/ [2] https://lore.kernel.org/amd-gfx/20241112-amdgpu-drm_edid-v2-0-1399dc0f0...@weissschuh.net/ [3] https://gitlab.freedesktop.org/mwen/linux-amd/-/commits/drm_edid_product_id_v4 [4] https://gitlab.freedesktop.org/mwen/linux-amd/-/commits/drm_edid_priv Melissa Wen (7): drm/amd/display: make sure drm_edid stored in aconnector doesn't leak drm/amd/display: start using drm_edid helpers to parse EDID caps drm/amd/display: use drm_edid_product_id for parsing EDID product info drm/amd/display: parse display name from drm_eld drm/amd/display: get panel id with drm_edid helper drm/amd/display: get SAD from drm_eld when parsing EDID caps drm/amd/display: get SADB from drm_eld when parsing EDID caps .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 + .../amd/display/amdgpu_dm/amdgpu_dm_helpers.c | 85 +-- 2 files changed, 42 insertions(+), 45 deletions(-)
Re: [PATCH] drm/amdgpu: Add debug masks for HDCP LC FW testing
[AMD Official Use Only - AMD Internal Distribution Only] Reviewed-by: Aurabindo Pillai -- Regards, Jay From: Dominik Kaszewski Sent: Thursday, March 13, 2025 4:52 AM To: amd-gfx@lists.freedesktop.org Cc: Pillai, Aurabindo Subject: [PATCH] drm/amdgpu: Add debug masks for HDCP LC FW testing HDCP Locality Check is being moved to FW, add debug flags to control its behavior in existing hardware for validation purposes. Signed-off-by: Dominik Kaszewski --- drivers/gpu/drm/amd/include/amd_shared.h | 12 1 file changed, 12 insertions(+) diff --git a/drivers/gpu/drm/amd/include/amd_shared.h b/drivers/gpu/drm/amd/include/amd_shared.h index 485b713cfad0..4c95b885d1d0 100644 --- a/drivers/gpu/drm/amd/include/amd_shared.h +++ b/drivers/gpu/drm/amd/include/amd_shared.h @@ -358,6 +358,18 @@ enum DC_DEBUG_MASK { * @DC_DISABLE_CUSTOM_BRIGHTNESS_CURVE: If set, disable support for custom brightness curves */ DC_DISABLE_CUSTOM_BRIGHTNESS_CURVE = 0x4, + + /** +* @DC_HDCP_LC_FORCE_FW_ENABLE: If set, use HDCP Locality Check FW +* path regardless of reported HW capabilities. +*/ + DC_HDCP_LC_FORCE_FW_ENABLE = 0x8, + + /** +* @DC_HDCP_LC_ENABLE_SW_FALLBACK If set, upon HDCP Locality Check FW +* path failure, retry using legacy SW path. +*/ + DC_HDCP_LC_ENABLE_SW_FALLBACK = 0x10, }; enum amd_dpm_forced_level; -- 2.43.0
[PATCH] drm/amdgpu: drop drm_firmware_drivers_only()
There are a number of systems and cloud providers out there that have nomodeset hardcoded in their kernel parameters to block nouveau for the nvidia driver. This prevents the amdgpu driver from loading. Unfortunately the end user cannot easily change this. The preferred way to block modules from loading is to use modprobe.blacklist=. That is what providers should be using to block specific drivers. Drop the check to allow the driver to load even when nomodeset is specified on the kernel command line. Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 3b34fdd105937..dd86661153582 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -3022,9 +3022,6 @@ static int __init amdgpu_init(void) { int r; - if (drm_firmware_drivers_only()) - return -EINVAL; - r = amdgpu_sync_init(); if (r) goto error_sync; -- 2.48.1
Re: [PATCH V5 00/11] Add disable kernel queue support
n 03/13, Alex Deucher wrote: > To better evaluate user queues, add a module parameter > to disable kernel queues. With this set kernel queues > are disabled and only user queues are available. This > frees up hardware resources for use in user queues which > would otherwise be used by kernel queues and provides > a way to validate user queues without the presence > of kernel queues. Hi Alex, I'm trying to understand how GFX and MES deal with different queues, and I used this patchset to guide me through that. In this sense, could you help me with the following points? FWIU, the GFX has what are called pipes, which in turn have hardware queues associated with them. For example, a GFX can have 2 pipes, and each pipe could have 2 hardware queues; or it could have 1 pipe and 8 queue. Is this correct? (for this next part, suppose 1 pipe 2 hardware queues) By default, one of the hardware queues is reserved for the Kernel Queue, and the user space could use the other. GFX has the MES block "connected" to all pipe queues, and MES is responsible for scheduling different ring buffers (in memory) in the pipe's hardware queue (effectively making the ring active). However, since the kernel queue is always present, MES only performs scheduling in one of the hardware queues. This scheduling occurs with the MES mapping and unmapping available Rings in memory to the hardware queue. Does the above description sound correct to you? How about the below diagram? Does it look correct to you? (I hope the diagram looks fine in your email client; if not, I can attach a picture of it.) +---+ | GFX | | | | +-+ | | +-+ (Hw Queue 0) | Kernel Queue (No eviction) +--- No MES Scheduling| | |(Hardware Queue 0) | --->| | | | |PIPE 0 | - | +-+ X | | |(Hardware Queue 1) | +--+-+ | | | - |--+ || | | | | | ++ || | | +-+ | (Hw Queue 1) || | MES Schedules| | |+> | User Queue+-+| | | || || | | ++ || | | ++ | | | | | +-+ | | |Un/Map Ring | | || +---+ | +-++ | MEMORY v| |
[PATCH 04/11] drm/amdgpu/mes: centralize gfx_hqd mask management
Move it to amdgpu_mes to align with the compute and sdma hqd masks. No functional change. v2: rebase on new changes Reviewed-by: Sunil Khatri Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 22 ++ drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 16 +++- drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 15 +++ 3 files changed, 28 insertions(+), 25 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c index 6f5e272d7ded3..5abc1ca0fee98 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c @@ -108,6 +108,28 @@ int amdgpu_mes_init(struct amdgpu_device *adev) adev->mes.vmid_mask_mmhub = 0xff00; adev->mes.vmid_mask_gfxhub = 0xff00; + for (i = 0; i < AMDGPU_MES_MAX_GFX_PIPES; i++) { + /* use only 1st ME pipe */ + if (i >= adev->gfx.me.num_pipe_per_me) + continue; + if (amdgpu_ip_version(adev, GC_HWIP, 0) >= + IP_VERSION(12, 0, 0)) + /* +* GFX V12 has only one GFX pipe, but 8 queues in it. +* GFX pipe 0 queue 0 is being used by Kernel queue. +* Set GFX pipe 0 queue 1-7 for MES scheduling +* mask = 1110b +*/ + adev->mes.gfx_hqd_mask[i] = 0xFE; + else + /* +* GFX pipe 0 queue 0 is being used by Kernel queue. +* Set GFX pipe 0 queue 1 for MES scheduling +* mask = 10b +*/ + adev->mes.gfx_hqd_mask[i] = 0x2; + } + for (i = 0; i < AMDGPU_MES_MAX_COMPUTE_PIPES; i++) { /* use only 1st MEC pipes */ if (i >= adev->gfx.mec.num_pipe_per_mec) diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c index a569d09a1a748..39b45d8b5f049 100644 --- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c @@ -669,18 +669,6 @@ static int mes_v11_0_misc_op(struct amdgpu_mes *mes, offsetof(union MESAPI__MISC, api_status)); } -static void mes_v11_0_set_gfx_hqd_mask(union MESAPI_SET_HW_RESOURCES *pkt) -{ - /* -* GFX pipe 0 queue 0 is being used by Kernel queue. -* Set GFX pipe 0 queue 1 for MES scheduling -* mask = 10b -* GFX pipe 1 can't be used for MES due to HW limitation. -*/ - pkt->gfx_hqd_mask[0] = 0x2; - pkt->gfx_hqd_mask[1] = 0; -} - static int mes_v11_0_set_hw_resources(struct amdgpu_mes *mes) { int i; @@ -705,7 +693,9 @@ static int mes_v11_0_set_hw_resources(struct amdgpu_mes *mes) mes_set_hw_res_pkt.compute_hqd_mask[i] = mes->compute_hqd_mask[i]; - mes_v11_0_set_gfx_hqd_mask(&mes_set_hw_res_pkt); + for (i = 0; i < MAX_GFX_PIPES; i++) + mes_set_hw_res_pkt.gfx_hqd_mask[i] = + mes->gfx_hqd_mask[i]; for (i = 0; i < MAX_SDMA_PIPES; i++) mes_set_hw_res_pkt.sdma_hqd_mask[i] = mes->sdma_hqd_mask[i]; diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c index 96336652d14c5..519f054bec60d 100644 --- a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c +++ b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c @@ -694,17 +694,6 @@ static int mes_v12_0_set_hw_resources_1(struct amdgpu_mes *mes, int pipe) offsetof(union MESAPI_SET_HW_RESOURCES_1, api_status)); } -static void mes_v12_0_set_gfx_hqd_mask(union MESAPI_SET_HW_RESOURCES *pkt) -{ - /* -* GFX V12 has only one GFX pipe, but 8 queues in it. -* GFX pipe 0 queue 0 is being used by Kernel queue. -* Set GFX pipe 0 queue 1-7 for MES scheduling -* mask = 1110b -*/ - pkt->gfx_hqd_mask[0] = 0xFE; -} - static int mes_v12_0_set_hw_resources(struct amdgpu_mes *mes, int pipe) { int i; @@ -727,7 +716,9 @@ static int mes_v12_0_set_hw_resources(struct amdgpu_mes *mes, int pipe) mes_set_hw_res_pkt.compute_hqd_mask[i] = mes->compute_hqd_mask[i]; - mes_v12_0_set_gfx_hqd_mask(&mes_set_hw_res_pkt); + for (i = 0; i < MAX_GFX_PIPES; i++) + mes_set_hw_res_pkt.gfx_hqd_mask[i] = + mes->gfx_hqd_mask[i]; for (i = 0; i < MAX_SDMA_PIPES; i++) mes_set_hw_res_pkt.sdma_hqd_mask[i] = -- 2.48.1
Re: [PATCH 07/11] drm/amdgpu/gfx11: add support for disable_kq
On Thu, Mar 13, 2025 at 6:08 PM Rodrigo Siqueira wrote: > > On 03/13, Alex Deucher wrote: > > Plumb in support for disabling kernel queues in > > GFX11. We have to bring up a GFX queue briefly in > > order to initialize the clear state. After that > > we can disable it. > > > > v2: use ring counts per Felix' suggestion > > v3: fix stream fault handler, enable EOP interrupts > > v4: fix MEC interrupt offset (Sunil) > > > > Signed-off-by: Alex Deucher > > --- > > drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 191 ++--- > > 1 file changed, 136 insertions(+), 55 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > > b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > > index 95eefd9a40d28..fde8464cbd3b3 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c > > @@ -1145,6 +1145,10 @@ static int gfx_v11_0_gfx_ring_init(struct > > amdgpu_device *adev, int ring_id, > > > > ring->ring_obj = NULL; > > ring->use_doorbell = true; > > + if (adev->gfx.disable_kq) { > > + ring->no_scheduler = true; > > Hi Alex, > > Just a question about this no_scheduler part. > > Set no_scheduler to true, means that all of the queues of GFX11 will not > be preempted, right? I suppose you have to do it because you want to > initialize the clear state? Not exactly. We just spin up a gfx queue long enough to submit the clear state setup and then we tear it down so its queue slot is available for user queues. So it's not actually a usable kernel queue at runtime. Setting the no_scheduler flag prevents a drm scheduler from being initialized for the queue. Alex > > Thanks > > > + ring->no_user_submission = true; > > + } > > > > if (!ring_id) > > ring->doorbell_index = adev->doorbell_index.gfx_ring0 << 1; > > @@ -1577,7 +1581,7 @@ static void gfx_v11_0_alloc_ip_dump(struct > > amdgpu_device *adev) > > > > static int gfx_v11_0_sw_init(struct amdgpu_ip_block *ip_block) > > { > > - int i, j, k, r, ring_id = 0; > > + int i, j, k, r, ring_id; > > int xcc_id = 0; > > struct amdgpu_device *adev = ip_block->adev; > > > > @@ -1710,37 +1714,42 @@ static int gfx_v11_0_sw_init(struct amdgpu_ip_block > > *ip_block) > > return r; > > } > > > > - /* set up the gfx ring */ > > - for (i = 0; i < adev->gfx.me.num_me; i++) { > > - for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) { > > - for (k = 0; k < adev->gfx.me.num_pipe_per_me; k++) { > > - if (!amdgpu_gfx_is_me_queue_enabled(adev, i, > > k, j)) > > - continue; > > - > > - r = gfx_v11_0_gfx_ring_init(adev, ring_id, > > - i, k, j); > > - if (r) > > - return r; > > - ring_id++; > > + if (adev->gfx.num_gfx_rings) { > > + ring_id = 0; > > + /* set up the gfx ring */ > > + for (i = 0; i < adev->gfx.me.num_me; i++) { > > + for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) > > { > > + for (k = 0; k < adev->gfx.me.num_pipe_per_me; > > k++) { > > + if > > (!amdgpu_gfx_is_me_queue_enabled(adev, i, k, j)) > > + continue; > > + > > + r = gfx_v11_0_gfx_ring_init(adev, > > ring_id, > > + i, k, j); > > + if (r) > > + return r; > > + ring_id++; > > + } > > } > > } > > } > > > > - ring_id = 0; > > - /* set up the compute queues - allocate horizontally across pipes */ > > - for (i = 0; i < adev->gfx.mec.num_mec; ++i) { > > - for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; j++) { > > - for (k = 0; k < adev->gfx.mec.num_pipe_per_mec; k++) { > > - if (!amdgpu_gfx_is_mec_queue_enabled(adev, 0, > > i, > > - k, j)) > > - continue; > > + if (adev->gfx.num_compute_rings) { > > + ring_id = 0; > > + /* set up the compute queues - allocate horizontally across > > pipes */ > > + for (i = 0; i < adev->gfx.mec.num_mec; ++i) { > > + for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; > > j++) { > > + for (k = 0; k < > > adev->gfx.mec.num_pipe_per_mec; k++) { > > + if > > (!amdgpu_gfx_is_mec_queue_enabled(adev, 0, i
Re: [PATCH 02/11] drm/amdgpu: add ring flag for no user submissions
On Thu, Mar 13, 2025 at 5:53 PM Rodrigo Siqueira wrote: > > On 03/13, Alex Deucher wrote: > > This would be set by IPs which only accept submissions > > from the kernel, not userspace, such as when kernel > > queues are disabled. Don't expose the rings to userspace > > and reject any submissions in the CS IOCTL. > > Just out of curiosity, is CS == Command Submission? Yes. > > > > > Reviewed-by: Sunil Khatri > > Signed-off-by: Alex Deucher > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 4 > > drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 30 > > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 +- > > 3 files changed, 25 insertions(+), 11 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > > index 5df21529b3b13..5cc18034b75df 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c > > @@ -349,6 +349,10 @@ static int amdgpu_cs_p2_ib(struct amdgpu_cs_parser *p, > > ring = amdgpu_job_ring(job); > > ib = &job->ibs[job->num_ibs++]; > > > > + /* submissions to kernel queus are disabled */ > > /queus/queues/ will fix. > > > + if (ring->no_user_submission) > > + return -EINVAL; > > Since this will be set for ASICs that don't accept userspace > submissions, maybe -ENOTSUPP would be more accurate? I could go either way on that. When kernel queues are disabled, CS submissions to a kernel queue are not supported, but it's also an invalid parameter to the IOCTL. I don't have a strong opinion. Maybe -ENOTSUPP would differentiate it from just general invalid parameters. Alex > > Thanks > > > + > > /* MM engine doesn't support user fences */ > > if (p->uf_bo && ring->funcs->no_user_fence) > > return -EINVAL; > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > index cd6eb7a3bc58a..3b7dfd56ccd0e 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c > > @@ -408,7 +408,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, > > case AMDGPU_HW_IP_GFX: > > type = AMD_IP_BLOCK_TYPE_GFX; > > for (i = 0; i < adev->gfx.num_gfx_rings; i++) > > - if (adev->gfx.gfx_ring[i].sched.ready) > > + if (adev->gfx.gfx_ring[i].sched.ready && > > + !adev->gfx.gfx_ring[i].no_user_submission) > > ++num_rings; > > ib_start_alignment = 32; > > ib_size_alignment = 32; > > @@ -416,7 +417,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, > > case AMDGPU_HW_IP_COMPUTE: > > type = AMD_IP_BLOCK_TYPE_GFX; > > for (i = 0; i < adev->gfx.num_compute_rings; i++) > > - if (adev->gfx.compute_ring[i].sched.ready) > > + if (adev->gfx.compute_ring[i].sched.ready && > > + !adev->gfx.compute_ring[i].no_user_submission) > > ++num_rings; > > ib_start_alignment = 32; > > ib_size_alignment = 32; > > @@ -424,7 +426,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, > > case AMDGPU_HW_IP_DMA: > > type = AMD_IP_BLOCK_TYPE_SDMA; > > for (i = 0; i < adev->sdma.num_instances; i++) > > - if (adev->sdma.instance[i].ring.sched.ready) > > + if (adev->sdma.instance[i].ring.sched.ready && > > + !adev->gfx.gfx_ring[i].no_user_submission) > > ++num_rings; > > ib_start_alignment = 256; > > ib_size_alignment = 4; > > @@ -435,7 +438,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, > > if (adev->uvd.harvest_config & (1 << i)) > > continue; > > > > - if (adev->uvd.inst[i].ring.sched.ready) > > + if (adev->uvd.inst[i].ring.sched.ready && > > + !adev->uvd.inst[i].ring.no_user_submission) > > ++num_rings; > > } > > ib_start_alignment = 256; > > @@ -444,7 +448,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, > > case AMDGPU_HW_IP_VCE: > > type = AMD_IP_BLOCK_TYPE_VCE; > > for (i = 0; i < adev->vce.num_rings; i++) > > - if (adev->vce.ring[i].sched.ready) > > + if (adev->vce.ring[i].sched.ready && > > + !adev->vce.ring[i].no_user_submission) > > ++num_rings; > > ib_start_alignment = 256; > > ib_size_alignment = 4; > > @@ -456,7 +461,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev, > > continue
[PATCH 3/3] drm/amdgpu: don't free conflicting apertures for non-display devices
PCI_CLASS_ACCELERATOR_PROCESSING devices won't ever be the sysfb, so there is no need to free conflicting apertures. Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 1eff6252f66b4..689f1833d02b7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4401,10 +4401,17 @@ int amdgpu_device_init(struct amdgpu_device *adev, if (r) return r; - /* Get rid of things like offb */ - r = aperture_remove_conflicting_pci_devices(adev->pdev, amdgpu_kms_driver.name); - if (r) - return r; + /* +* No need to remove conflicting FBs for non-display class devices. +* This prevents the sysfb from being freed accidently. +*/ + if ((pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA || + (pdev->class >> 8) == PCI_CLASS_DISPLAY_OTHER) { + /* Get rid of things like offb */ + r = aperture_remove_conflicting_pci_devices(adev->pdev, amdgpu_kms_driver.name); + if (r) + return r; + } /* Enable TMZ based on IP_VERSION */ amdgpu_gmc_tmz_set(adev); -- 2.48.1
[PATCH 1/3] drm/amdgpu: drop drm_firmware_drivers_only()
There are a number of systems and cloud providers out there that have nomodeset hardcoded in their kernel parameters to block nouveau for the nvidia driver. This prevents the amdgpu driver from loading. Unfortunately the end user cannot easily change this. The preferred way to block modules from loading is to use modprobe.blacklist=. That is what providers should be using to block specific drivers. Drop the check to allow the driver to load even when nomodeset is specified on the kernel command line. Reviewed-by: Kent Russell Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 3b34fdd105937..dd86661153582 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -3022,9 +3022,6 @@ static int __init amdgpu_init(void) { int r; - if (drm_firmware_drivers_only()) - return -EINVAL; - r = amdgpu_sync_init(); if (r) goto error_sync; -- 2.48.1
[PATCH 2/3] drm/amdgpu: adjust drm_firmware_drivers_only() handling
Move to probe so we can check the PCI device type and only apply the drm_firmware_drivers_only() check for PCI DISPLAY classes. Also add a module parameter to override the nomodeset kernel parameter as a workaround for platforms that have this hardcoded on their kernel command lines. Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index dd86661153582..4e1a6a249bba5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -178,6 +178,7 @@ uint amdgpu_sdma_phase_quantum = 32; char *amdgpu_disable_cu; char *amdgpu_virtual_display; bool enforce_isolation; +int amdgpu_modeset = -1; /* Specifies the default granularity for SVM, used in buffer * migration and restoration of backing memory when handling @@ -1040,6 +1041,13 @@ module_param_named(user_partt_mode, amdgpu_user_partt_mode, uint, 0444); module_param(enforce_isolation, bool, 0444); MODULE_PARM_DESC(enforce_isolation, "enforce process isolation between graphics and compute . enforce_isolation = on"); +/** + * DOC: modeset (int) + * Override nomodeset (1 = override, -1 = auto). The default is -1 (auto). + */ +MODULE_PARM_DESC(modeset, "Override nomodeset (1 = enable, -1 = auto)"); +module_param_named(modeset, amdgpu_modeset, int, 0444); + /** * DOC: seamless (int) * Seamless boot will keep the image on the screen during the boot process. @@ -2270,6 +2278,12 @@ static int amdgpu_pci_probe(struct pci_dev *pdev, int ret, retry = 0, i; bool supports_atomic = false; + if ((pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA || + (pdev->class >> 8) == PCI_CLASS_DISPLAY_OTHER) { + if (drm_firmware_drivers_only() && amdgpu_modeset == -1) + return -EINVAL; + } + /* skip devices which are owned by radeon */ for (i = 0; i < ARRAY_SIZE(amdgpu_unsupported_pciidlist); i++) { if (amdgpu_unsupported_pciidlist[i] == pdev->device) -- 2.48.1
Re: [PATCH V5 00/11] Add disable kernel queue support
On Thu, Mar 13, 2025 at 6:21 PM Rodrigo Siqueira wrote: > > n 03/13, Alex Deucher wrote: > > To better evaluate user queues, add a module parameter > > to disable kernel queues. With this set kernel queues > > are disabled and only user queues are available. This > > frees up hardware resources for use in user queues which > > would otherwise be used by kernel queues and provides > > a way to validate user queues without the presence > > of kernel queues. > > Hi Alex, > > I'm trying to understand how GFX and MES deal with different queues, and > I used this patchset to guide me through that. In this sense, could you > help me with the following points? > > FWIU, the GFX has what are called pipes, which in turn have hardware > queues associated with them. For example, a GFX can have 2 pipes, and > each pipe could have 2 hardware queues; or it could have 1 pipe and 8 > queue. Is this correct? Right. For gfx, compute, and SDMA you have pipes (called instances on SDMA) and queues. A pipe can only execute one queue at a time. The pipe will switch between all of the mapped queues. You have storage in memory (called an MDQ -- Memory Queue Descriptor) which defines the state of the queue (GPU virtual addresses of the queue itself, save areas, doorbell, etc.). The queues that the pipe switches between are defined by HQDs (Hardware Queue Descriptors). These are basically register based memory for the queues that the pipe can switch between. The driver sets up an MQD for each queue that it creates. The MQDs are then handed to the MES firmware for mapping. The MES firmware can map a queue as a legacy queue (i.e. a kernel queue) or a user queue. The difference is that a legacy queue is statically mapped to a HQD and is never preempted. User queues are dynamically mapped to the HQDs by the MES firmware. If there are more MQDs than HQDs, the MES firmware will preempt other user queues to make sure each queue gets a time slice. > > (for this next part, suppose 1 pipe 2 hardware queues) > By default, one of the hardware queues is reserved for the Kernel Queue, > and the user space could use the other. GFX has the MES block "connected" > to all pipe queues, and MES is responsible for scheduling different ring > buffers (in memory) in the pipe's hardware queue (effectively making the > ring active). However, since the kernel queue is always present, MES > only performs scheduling in one of the hardware queues. This scheduling > occurs with the MES mapping and unmapping available Rings in memory to > the hardware queue. > > Does the above description sound correct to you? How about the below > diagram? Does it look correct to you? More or less. The MES handles all of the queues (kernel or user). The only real difference is that kernel queues are statically mapped to an HQD while user queues are dynamically scheduled in the available HQDs based on level of over-subscription. E.g., if you have hardware with 1 pipe and 2 HQDs you could have a kernel queue on 1 HQD and the MES would schedule all of the user queues on the remaining 1 HQD. If you don't enable any kernel queues, then you have 2 HQDs that the MES can use for scheduling user queues. > > (I hope the diagram looks fine in your email client; if not, I can > attach a picture of it.) > > +---+ > | GFX > | > | > | > | > +-+ | > | +-+ (Hw Queue 0) > | Kernel Queue (No eviction) +--- No MES Scheduling| > | |(Hardware Queue 0) | > --->| | | > | > |PIPE 0 | - | > +-+ X | > | |(Hardware Queue 1) | > +--+-+ | > | | - |--+ > || | > | | | | > ++ || | > | +-+ | (Hw Queue 1) > || | MES Schedules| | > | > +---
[PATCH] drm/amdgpu: Enable ACA by default for psp v13_0_6/v13_0_14
Enable ACA by default for psp v13_0_6/v13_0_14. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 7cf8a3036828..cfec29835634 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -3785,9 +3785,11 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev) adev->ras_enabled = amdgpu_ras_enable == 0 ? 0 : adev->ras_hw_enabled & amdgpu_ras_mask; - /* aca is disabled by default except for psp v13_0_12 */ + /* aca is disabled by default except for psp v13_0_6/v13_0_12/v13_0_14 */ adev->aca.is_enabled = - (amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 12)); + (amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 6) || +amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 12) || +amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 14)); /* bad page feature is not applicable to specific app platform */ if (adev->gmc.is_app_apu && -- 2.34.1
RE: [PATCH] drm/amdgpu: Fix computation for remain size of CPER ring
[AMD Official Use Only - AMD Internal Distribution Only] Reviewed-by: Hawking Zhang Regards, Hawking -Original Message- From: Liu, Xiang(Dean) Sent: Thursday, March 13, 2025 11:28 To: amd-gfx@lists.freedesktop.org Cc: Zhang, Hawking ; Zhou1, Tao ; Liu, Xiang(Dean) Subject: [PATCH] drm/amdgpu: Fix computation for remain size of CPER ring The mistake of computation for remain size of CPER ring will cause unbreakable while cycle when CPER ring overflow. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c index 47fe8a04e26a..d4e90785ee33 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c @@ -452,10 +452,10 @@ static u32 amdgpu_cper_ring_get_ent_sz(struct amdgpu_ring *ring, u64 pos) return umin(rec_len, chunk); } -void amdgpu_cper_ring_write(struct amdgpu_ring *ring, - void *src, int count) +void amdgpu_cper_ring_write(struct amdgpu_ring *ring, void *src, int +count) { u64 pos, wptr_old, rptr = *ring->rptr_cpu_addr & ring->ptr_mask; + int rec_cnt_dw = count >> 2; u32 chunk, ent_sz; u8 *s = (u8 *)src; @@ -482,6 +482,9 @@ void amdgpu_cper_ring_write(struct amdgpu_ring *ring, s += chunk; } + if (ring->count_dw < rec_cnt_dw) + ring->count_dw = 0; + /* the buffer is overflow, adjust rptr */ if (((wptr_old < rptr) && (rptr <= ring->wptr)) || ((ring->wptr < wptr_old) && (wptr_old < rptr)) || @@ -498,12 +501,10 @@ void amdgpu_cper_ring_write(struct amdgpu_ring *ring, pos = rptr; } while (!amdgpu_cper_is_hdr(ring, rptr)); } - mutex_unlock(&ring->adev->cper.ring_lock); - if (ring->count_dw >= (count >> 2)) - ring->count_dw -= (count >> 2); - else - ring->count_dw = 0; + if (ring->count_dw >= rec_cnt_dw) + ring->count_dw -= rec_cnt_dw; + mutex_unlock(&ring->adev->cper.ring_lock); } static u64 amdgpu_cper_ring_get_rptr(struct amdgpu_ring *ring) -- 2.34.1
RE: [PATCH] drm/amdgpu: format old RAS eeprom data into V3 version
[AMD Official Use Only - AMD Internal Distribution Only] Reviewed-by: Hawking Zhang Regards, Hawking -Original Message- From: amd-gfx On Behalf Of Tao Zhou Sent: Wednesday, March 12, 2025 18:06 To: amd-gfx@lists.freedesktop.org Cc: Zhou1, Tao Subject: [PATCH] drm/amdgpu: format old RAS eeprom data into V3 version Clear old data and save it in V3 format. v2: only format eeprom data for new ASICs. Signed-off-by: Tao Zhou --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 7 + .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 26 ++- .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h| 1 + 3 files changed, 22 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 837f33698b38..d3b9b4d9fb89 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -3465,6 +3465,13 @@ int amdgpu_ras_init_badpage_info(struct amdgpu_device *adev) adev, control->bad_channel_bitmap); con->update_channel_flag = false; } + + /* The format action is only applied to new ASICs */ + if (IP_VERSION_MAJ(amdgpu_ip_version(adev, UMC_HWIP, 0)) >= 12 && + control->tbl_hdr.version < RAS_TABLE_VER_V3) + if (!amdgpu_ras_eeprom_reset_table(control)) + if (amdgpu_ras_save_bad_pages(adev, NULL)) + dev_warn(adev->dev, "Failed to format RAS EEPROM data in V3 +version!\n"); } return ret; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c index 09a6f8bc1a5a..71dddb8983ee 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c @@ -413,9 +413,11 @@ static void amdgpu_ras_set_eeprom_table_version(struct amdgpu_ras_eeprom_control switch (amdgpu_ip_version(adev, UMC_HWIP, 0)) { case IP_VERSION(8, 10, 0): - case IP_VERSION(12, 0, 0): hdr->version = RAS_TABLE_VER_V2_1; return; + case IP_VERSION(12, 0, 0): + hdr->version = RAS_TABLE_VER_V3; + return; default: hdr->version = RAS_TABLE_VER_V1; return; @@ -443,7 +445,7 @@ int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control *control) hdr->header = RAS_TABLE_HDR_VAL; amdgpu_ras_set_eeprom_table_version(control); - if (hdr->version == RAS_TABLE_VER_V2_1) { + if (hdr->version >= RAS_TABLE_VER_V2_1) { hdr->first_rec_offset = RAS_RECORD_START_V2_1; hdr->tbl_size = RAS_TABLE_HEADER_SIZE + RAS_TABLE_V2_1_INFO_SIZE; @@ -461,7 +463,7 @@ int amdgpu_ras_eeprom_reset_table(struct amdgpu_ras_eeprom_control *control) } csum = __calc_hdr_byte_sum(control); - if (hdr->version == RAS_TABLE_VER_V2_1) + if (hdr->version >= RAS_TABLE_VER_V2_1) csum += __calc_ras_info_byte_sum(control); csum = -csum; hdr->checksum = csum; @@ -752,7 +754,7 @@ amdgpu_ras_eeprom_update_header(struct amdgpu_ras_eeprom_control *control) "Saved bad pages %d reaches threshold value %d\n", control->ras_num_bad_pages, ras->bad_page_cnt_threshold); control->tbl_hdr.header = RAS_TABLE_HDR_BAD; - if (control->tbl_hdr.version == RAS_TABLE_VER_V2_1) { + if (control->tbl_hdr.version >= RAS_TABLE_VER_V2_1) { control->tbl_rai.rma_status = GPU_RETIRED__ECC_REACH_THRESHOLD; control->tbl_rai.health_percent = 0; } @@ -765,7 +767,7 @@ amdgpu_ras_eeprom_update_header(struct amdgpu_ras_eeprom_control *control) amdgpu_dpm_send_rma_reason(adev); } - if (control->tbl_hdr.version == RAS_TABLE_VER_V2_1) + if (control->tbl_hdr.version >= RAS_TABLE_VER_V2_1) control->tbl_hdr.tbl_size = RAS_TABLE_HEADER_SIZE + RAS_TABLE_V2_1_INFO_SIZE + control->ras_num_recs * RAS_TABLE_RECORD_SIZE; @@ -805,7 +807,7 @@ amdgpu_ras_eeprom_update_header(struct amdgpu_ras_eeprom_control *control) * now calculate gpu health percent */ if (amdgpu_bad_page_threshold != 0 && - control->tbl_hdr.version == RAS_TABLE_VER_V2_1 && + control->tbl_hdr.version >= RAS_TABLE_VER_V2_1 && control->ras_num_bad_pages <= ras->bad_page_cnt_threshold) control->tbl_rai.health_percent = ((ras->bad_page_cnt_threshold - control->ras_num_bad_pages) * 100) / @@ -818,7 +820,7 @@ amdgpu_ras_eeprom_update_header(struct amdg
Re: [PATCH 00/63] Fix CONFIG_DRM_USE_DYNAMIC_DEBUG=y
hello everyone, sorry for the late reply. I have a cleaner version cooking now. less inter-commit churn, by bringing more cleanups forward. I'll send a -v2 soon. (lets forget all the meandering crap versions I sent) Louis, thanks for testing ! I wrote the test script and submod.ko so the lib/* parts would stand by themselves. And this time, I left the old DECLARE_ macro, so DRM doesnt get a flag-day breakage :-) But for ease of testing, I'll keep the DRM parts in the series. Taking 1st N commits is normal workflow ? On Fri, Feb 28, 2025 at 9:24 AM Louis Chauvet wrote: > > > > Le 20/02/2025 à 10:45, Simona Vetter a écrit : > > On Thu, Feb 20, 2025 at 09:31:41AM +0100, Greg KH wrote: > >> On Fri, Jan 24, 2025 at 11:45:14PM -0700, Jim Cromie wrote: > >>> This series fixes dynamic-debug's support for DRM debug-categories. > >>> Classmaps-v1 evaded full review, and got committed in 2 chunks: > >>> > >>>b7b4eebdba7b..6ea3bf466ac6 # core dyndbg changes > >>>0406faf25fb1..ee7d633f2dfb # drm adoption > >>> > >>> DRM-CI found a regression during init with drm.debug=; the > >>> static-keys under the drm-dbgs in drm.ko got enabled, but those in > >>> drivers & helpers did not. > >>> > >>> Root Problem: > >>> > >>> DECLARE_DYNDBG_CLASSMAP violated a K&R rule "define once, refer > >>> afterwards". Replace it with DYNDBG_CLASSMAP_DEFINE (invoked once in > >>> drm-core) and DYNDBG_CLASSMAP_USE (invoked repeatedly, in drivers & > >>> helpers). > >>> > >>> _DEFINE exports the classmap it creates (in drm.ko), other modules > >>> _USE the classmap. The _USE adds a record ref'g the _DEFINEd (& > >>> exported) classmap, in a 2nd __dyndbg_class_users section. > >>> > >>> So now at modprobe, dyndbg scans the new section after the 1st > >>> __dyndbg_class_maps section, follows the linkage to the _DEFINEr > >>> module, finds the (optional) kernel-param controlling the classmap, > >>> examines its drm.debug=, and applies it to the module being > >>> initialized. > >>> > >>> To recapitulate the multi-module problem wo DRM involvement, Add: > >>> > >>> A. tools/testing/selftests/dynamic_debug/* > >>> > >>> This alters pr_debugs in the test-modules, counts the results and > >>> checks them against expectations. It uses this formula to test most > >>> of the control grammar, including the new class keyword. > >>> > >>> B. test_dynamic_debug_submod.ko > >>> > >>> This alters the test-module to build both parent & _submod ko's, with > >>> _DEFINE and _USE inside #if/#else blocks. This recap's DRM's 2 module > >>> failure scenario, allowing A to exersize several cases. > >>> > >>> The #if/#else puts the 2 macro uses together for clarity, and gives > >>> the 2 modules identical sets of debugs. > >>> > >>> Recent DRM-CI tests are here: > >>>https://patchwork.freedesktop.org/series/139147/ > >>> > >>> Previous rev: > >>> > >>> https://lore.kernel.org/lkml/20240716185806.1572048-1-jim.cro...@gmail.com/ > >>> > >>> Noteworthy Additions: > >>> > >>> 1- drop class "protection" special case, per JBaron's preference. > >>> only current use is marked BROKEN so nobody to affect. > >>> now framed as policy-choice: > >>> #define ddebug_client_module_protects_classes() false > >>> subsystems wanting protection can change this. > >>> > >>> 2- compile-time arg-tests in DYNDBG_CLASSMAP_DEFINE > >>> implement several required constraints, and fail obviously. > >>> > >>> 3- modprobe time check of conflicting class-id reservations > >>> only affects 2+classmaps users. > >>> compile-time solution not apparent. > >>> > >>> 4- dyndbg can now cause modprobe to fail. > >>> needed to catch 3. > >>> maybe some loose ends here on failure. > >>> > >>> 5- refactor & rename ddebug_attach_*module_classes > >>> reduce repetetive boilerplate on 2 types: maps, users. > >>> rework mostly brought forward in patchset to reduce churn > >>> TBD: maybe squash more. > >>> > >>> Several recent trybot submissions (against drm-tip) have been passing > >>> CI.BAT, and failing one or few CI.IGT tests randomly; re-tests do not > >>> reliably repeat the failures. > >>> > >>> its also at github.com:jimc/linux.git > >>>dd-fix-9[st]-ontip & dd-fix-9-13 > >>> > >>> Ive been running it on my desktop w/o issues. > >>> > >>> The drivers/gpu/drm patches are RFC, I think there might be a single > >>> place to call DRM_CLASSMAP_USE(drm_dedbug_classes) to replace the > >>> sprinkling of _USEs in drivers and helpers. IIRC, I tried adding a > >>> _DEFINE into drm_drv.c, that didn't do it, so I punted for now. > >>> > >>> I think the dyndbg core additions are ready for review and merging > >>> into a (next-next) test/integration tree. > >> > >> So whose tree should this go through? > > > > I'm trying to get some drm folks to review/test this, but thus far not > > much success :-/ I think it's good stuff, but I'm somewhat hesitant if no > > I tested the VKMS driver with this, and it works! > > Tested-by:
[PATCH] drm/amdgpu: Add debug masks for HDCP LC FW testing
HDCP Locality Check is being moved to FW, add debug flags to control its behavior in existing hardware for validation purposes. Signed-off-by: Dominik Kaszewski --- drivers/gpu/drm/amd/include/amd_shared.h | 12 1 file changed, 12 insertions(+) diff --git a/drivers/gpu/drm/amd/include/amd_shared.h b/drivers/gpu/drm/amd/include/amd_shared.h index 485b713cfad0..4c95b885d1d0 100644 --- a/drivers/gpu/drm/amd/include/amd_shared.h +++ b/drivers/gpu/drm/amd/include/amd_shared.h @@ -358,6 +358,18 @@ enum DC_DEBUG_MASK { * @DC_DISABLE_CUSTOM_BRIGHTNESS_CURVE: If set, disable support for custom brightness curves */ DC_DISABLE_CUSTOM_BRIGHTNESS_CURVE = 0x4, + + /** +* @DC_HDCP_LC_FORCE_FW_ENABLE: If set, use HDCP Locality Check FW +* path regardless of reported HW capabilities. +*/ + DC_HDCP_LC_FORCE_FW_ENABLE = 0x8, + + /** +* @DC_HDCP_LC_ENABLE_SW_FALLBACK If set, upon HDCP Locality Check FW +* path failure, retry using legacy SW path. +*/ + DC_HDCP_LC_ENABLE_SW_FALLBACK = 0x10, }; enum amd_dpm_forced_level; -- 2.43.0
RE: [PATCH] drm/amdgpu: Enable ACA by default for psp v13_0_6/v13_0_14
[AMD Official Use Only - AMD Internal Distribution Only] Reviewed-by: Hawking Zhang Regards, Hawking -Original Message- From: Liu, Xiang(Dean) Sent: Thursday, March 13, 2025 15:17 To: amd-gfx@lists.freedesktop.org Cc: Zhang, Hawking ; Zhou1, Tao ; Liu, Xiang(Dean) Subject: [PATCH] drm/amdgpu: Enable ACA by default for psp v13_0_6/v13_0_14 Enable ACA by default for psp v13_0_6/v13_0_14. Signed-off-by: Xiang Liu --- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c index 7cf8a3036828..cfec29835634 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c @@ -3785,9 +3785,11 @@ static void amdgpu_ras_check_supported(struct amdgpu_device *adev) adev->ras_enabled = amdgpu_ras_enable == 0 ? 0 : adev->ras_hw_enabled & amdgpu_ras_mask; - /* aca is disabled by default except for psp v13_0_12 */ + /* aca is disabled by default except for psp v13_0_6/v13_0_12/v13_0_14 */ adev->aca.is_enabled = - (amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 12)); + (amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 6) || +amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 12) || +amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 14)); /* bad page feature is not applicable to specific app platform */ if (adev->gmc.is_app_apu && -- 2.34.1
Re: [PATCH 1/2] drm: Create an app info option for wedge events
Em 12/03/2025 07:06, Raag Jadav escreveu: On Tue, Mar 11, 2025 at 07:09:45PM +0200, Raag Jadav wrote: On Mon, Mar 10, 2025 at 06:27:53PM -0300, André Almeida wrote: Em 01/03/2025 02:53, Raag Jadav escreveu: On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote: Hi Raag, On 2/28/25 11:20, Raag Jadav wrote: Cc: Lucas On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote: When a device get wedged, it might be caused by a guilty application. For userspace, knowing which app was the cause can be useful for some situations, like for implementing a policy, logs or for giving a chance for the compositor to let the user know what app caused the problem. This is an optional argument, when `PID=-1` there's no information about the app caused the problem, or if any app was involved during the hang. Sometimes just the PID isn't enough giving that the app might be already dead by the time userspace will try to check what was this PID's name, so to make the life easier also notify what's the app's name in the user event. Signed-off-by: André Almeida [...] len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED="); @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned long method) drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ? "but recovered through reset" : "needs recovery"); + if (info) { + snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid); + snprintf(comm_string, sizeof(comm_string), "APP=%s", info->comm); + } else { + snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1"); + snprintf(comm_string, sizeof(comm_string), "%s", "APP=none"); + } This is not much use for wedge cases that needs recovery, since at that point the userspace will need to clean house anyway. Which leaves us with only 'none' case and perhaps the need for standardization of "optional telemetry collection". Thoughts? I had the feeling that 'none' was already meant to be used for that. Do you think we should move to another naming? Given that we didn't reach the merge window yet we could potentially change that name without much damage. No, I meant thoughts on possible telemetry data that the drivers might think is useful for userspace (along with PID) and can be presented in a vendor agnostic manner (just like wedged event). I'm not if I agree that this will only be used for telemetry and for the `none` use case. As stated by Xaver, there's use case to know which app caused the device to get wedged (like switching to software rendering) and to display something for the user after the recovery is done (e.g. "The game stopped working and Plasma has reset"). Sure, but since this information is already available in coredump, I was hoping to have something like a standardized DRM level coredump with both vendor specific and agnostic sections, which the drivers can (and hopefully transition to) use in conjunction with wedged event to provide wider telemetry and is useful for all wedge cases. This is more useful because, 1. It gives drivers an opportunity to present the telemetry that they are interested in without needing to add a new event string (like PID or APP) for their case. 2. When we consider wedging as a usecase, there's a lot more that goes into it than an application that might be behaving strangely. So a wider telemetry is what I would hope to look at in such a scenario. I agree that coredump is the way to go for telemetry, we already have the name and PID of the guilty app there, along with more information about the GPU state. But I don't think it should be consumed like an uAPI. Even if we wire up some common DRM code for that, I don't think we can guarantee the stability of it as we can do for an uevent. coredump can be disabled and by default is only accessible by root. So I think that coredump is really good after the fact and if the user is willing to go ahead and report a bug somewhere. But for the goal of notifying the compositor, the same uevent that the compositor is already listening to will have everything they need to deal with this reset.