date:20250313

RE: [PATCH] drm/amdgpu: drop drm_firmware_drivers_only()

2025-03-13 Thread Russell, Kent

[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Kent Russell 



> -Original Message-
> From: amd-gfx  On Behalf Of Alex
> Deucher
> Sent: Thursday, March 13, 2025 5:05 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander 
> Subject: [PATCH] drm/amdgpu: drop drm_firmware_drivers_only()
>
> There are a number of systems and cloud providers out there
> that have nomodeset hardcoded in their kernel parameters
> to block nouveau for the nvidia driver.  This prevents the
> amdgpu driver from loading. Unfortunately the end user cannot
> easily change this.  The preferred way to block modules from
> loading is to use modprobe.blacklist=.  That is what
> providers should be using to block specific drivers.
>
> Drop the check to allow the driver to load even when nomodeset
> is specified on the kernel command line.
>
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 3b34fdd105937..dd86661153582 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -3022,9 +3022,6 @@ static int __init amdgpu_init(void)
>  {
>   int r;
>
> - if (drm_firmware_drivers_only())
> - return -EINVAL;
> -
>   r = amdgpu_sync_init();
>   if (r)
>   goto error_sync;
> --
> 2.48.1

Re: [PATCH 5/8] drm/amdgpu: rework how the cleaner shader is emitted v3

2025-03-13 Thread SRINIVASAN SHANMUGAM




On 2/18/2025 9:43 PM, Christian König wrote:

Instead of emitting the cleaner shader for every job which has the
enforce_isolation flag set only emit it for the first submission from
every client.

v2: add missing NULL check
v3: fix another NULL pointer deref

Signed-off-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 27 --
  1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index c9c48b782ec1..5323dba2d688 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -643,6 +643,7 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct 
amdgpu_job *job,
bool need_pipe_sync)
  {
struct amdgpu_device *adev = ring->adev;
+   struct amdgpu_isolation *isolation = &adev->isolation[ring->xcp_id];
unsigned vmhub = ring->vm_hub;
struct amdgpu_vmid_mgr *id_mgr = &adev->vm_manager.id_mgr[vmhub];
struct amdgpu_vmid *id = &id_mgr->ids[job->vmid];
@@ -650,8 +651,9 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct 
amdgpu_job *job,
bool gds_switch_needed = ring->funcs->emit_gds_switch &&
job->gds_switch_needed;
bool vm_flush_needed = job->vm_needs_flush;
-   struct dma_fence *fence = NULL;
+   bool cleaner_shader_needed = false;
bool pasid_mapping_needed = false;
+   struct dma_fence *fence = NULL;
unsigned int patch;
int r;
  
@@ -674,8 +676,12 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job,

pasid_mapping_needed &= adev->gmc.gmc_funcs->emit_pasid_mapping &&
ring->funcs->emit_wreg;
  
+	cleaner_shader_needed = adev->gfx.enable_cleaner_shader &&

+   ring->funcs->emit_cleaner_shader && job->base.s_fence &&
+   &job->base.s_fence->scheduled == isolation->spearhead;
+
if (!vm_flush_needed && !gds_switch_needed && !need_pipe_sync &&
-   !(job->enforce_isolation && !job->vmid))
+   !cleaner_shader_needed)
return 0;
  
  	amdgpu_ring_ib_begin(ring);

@@ -686,9 +692,7 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct 
amdgpu_job *job,
if (need_pipe_sync)
amdgpu_ring_emit_pipeline_sync(ring);
  
-	if (adev->gfx.enable_cleaner_shader &&

-   ring->funcs->emit_cleaner_shader &&
-   job->enforce_isolation)
+   if (cleaner_shader_needed)
ring->funcs->emit_cleaner_shader(ring);
  
  	if (vm_flush_needed) {

@@ -710,7 +714,7 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct 
amdgpu_job *job,
job->oa_size);
}
  
-	if (vm_flush_needed || pasid_mapping_needed) {

+   if (vm_flush_needed || pasid_mapping_needed || cleaner_shader_needed) {
r = amdgpu_fence_emit(ring, &fence, NULL, 0);
if (r)
return r;
@@ -732,6 +736,17 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct 
amdgpu_job *job,
id->pasid_mapping = dma_fence_get(fence);
mutex_unlock(&id_mgr->lock);
}
+
+   /*
+* Make sure that all other submissions wait for the cleaner shader to
+* finish before we push them to the HW.
+*/
+   if (cleaner_shader_needed) {
+   mutex_lock(&adev->enforce_isolation_mutex);

Should we need to ensure spearhead is valid before putting?
if (isolation->spearhead)
    dma_fence_put(isolation->spearhead);

+   dma_fence_put(isolation->spearhead);
+   isolation->spearhead = dma_fence_get(fence);
+   mutex_unlock(&adev->enforce_isolation_mutex);
+   }
dma_fence_put(fence);
  
  	amdgpu_ring_patch_cond_exec(ring, patch);

Re: [PATCH 1/8] drm/amdgpu: grab an additional reference on the gang fence v2

2025-03-13 Thread SRINIVASAN SHANMUGAM




On 3/7/2025 7:18 PM, Christian König wrote:

We keep the gang submission fence around in adev, make sure that it
stays alive.

v2: fix memory leak on retry

Signed-off-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 198d29faa754..337543ec615c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -6889,18 +6889,26 @@ struct dma_fence *amdgpu_device_switch_gang(struct 
amdgpu_device *adev,
  {
struct dma_fence *old = NULL;
  
+	dma_fence_get(gang);

do {
dma_fence_put(old);
old = amdgpu_device_get_gang(adev);
if (old == gang)
break;
  
-		if (!dma_fence_is_signaled(old))

+   if (!dma_fence_is_signaled(old)) {


Here, should we need to check ?

    // Check if old fence isn't signaled
    if (old && !dma_fence_is_signaled(old)) {


+   dma_fence_put(gang);
return old;
+   }
  
  	} while (cmpxchg((struct dma_fence __force **)&adev->gang_submit,

 old, gang) != old);
  
+	/*

+* Drop it once for the exchanged reference in adev and once for the
+* thread local reference acquired in amdgpu_device_get_gang().
+*/
+   dma_fence_put(old);


if (old)
    dma_fence_put(old); // Ensure to release old reference  only if it 
is valid?




dma_fence_put(old);
return NULL;
  }

Re: commit 7ffb791423c7 breaks steam game

2025-03-13 Thread Balbir Singh

On 3/14/25 09:22, Bert Karwatzki wrote:
> Am Freitag, dem 14.03.2025 um 08:54 +1100 schrieb Balbir Singh:
>> On 3/14/25 05:12, Bert Karwatzki wrote:
>>> Am Donnerstag, dem 13.03.2025 um 22:47 +1100 schrieb Balbir Singh:


 Anyway, I think the nokaslr result is interesting, it seems like with 
 nokaslr
 even the older kernels have problems with the game

 Could you confirm if with nokaslr

>>> Now I've tested kernel 6.8.12 with nokaslr
>>>
 1. Only one single game stellaris is not working?
 2. The entire laptop does not work?
 3. Laptop works and other games work? Just one game is not working as
>>> expected?
>>>
>>>
>>> Stellaris is showing the input lag and the entire graphical user interface 
>>> shows
>>> the same input lag as long as stellaris is running.
>>> Civilization 6 shows the same input lag as stellaris, probably even worse.
>>> Magic the Gathering: Arena (with wine) works normally.
>>> Valheim also works normally.
>>> Crusader Kings 2 works normally
>>> Rogue Heroes: Ruins of Tasos (a Zelda lookalike) works normally.
>>> Baldur's Gate I & II and Icewind Dale work normally.
>>>
>>> Also the input lag is only in the GUI, if I switch to a text console (ctrl 
>>> + alt
>>> + Fn), input works normally even while the affected games are running.
>>>
>>> Games aside everything else (e.g. compiling kernels) seems to work with 
>>> nokaslr.
>>>
>>
>> Would it be fair to assume that anything Xorg/Wayland is working fine
>> when the game is not running, even with nokaslr?
>>
> Yes, Xorg (I'm normally using xfce4 as desktop) works fine. I also tested with
> gnome using Xwayland, here the buggy behaviour also exists, with the addtion
> that mouse position is off, i.e. to click a button in the game you have to 
> click
> somewhat above it.

So the issue is narrowed down to just the games you've mentioned with 
nokaslr/patch?

> 
>> +amd-gfx@lists.freedesktop.org to see if there are known issues with
>> nokaslr and the games you mentioned.
>>
>>
>> Balbir Singh
>>
>> PS: I came across an interesting link
>> https://www.alex-ionescu.com/behind-windows-x64s-44-bit-memory-addressing-limit/
>>
>> I think SLIST_HEADER is used by wine as well for user space and I am not sure
>> if in this situation the game is hitting this scenario, but surprisingly the 
>> other
>> games are not. This is assuming the game uses wine. I am not sure it's 
>> related,
>> but the 44 bits caught my attention.
> 
> Stellaris is a native linux game (x86_64), the one game (MTGA) I tested with
> wine worked fine.
> 

Thanks for the update! I wonder if there are any game logs. If you can look for 
any
warnings/errors it might be interesting to see where the difference is coming 
from?

> By the way, the warning
> [ T8562] WARNING: CPU: 14 PID: 8562 at mm/slub.c:5028
> __kvmalloc_node_noprof+0x2fd/0x360
> that appeared in the dmesg I sent you is caused by the upgrade of mesa from
> 25.0.0 to 25.0.1. (I'm still bisecting ...)
> 

Thanks!
Balbir

Re: [PATCH 5/8] drm/amdgpu: rework how the cleaner shader is emitted v3

2025-03-13 Thread SRINIVASAN SHANMUGAM



On 3/7/2025 7:18 PM, Christian König wrote:

Instead of emitting the cleaner shader for every job which has the
enforce_isolation flag set only emit it for the first submission from
every client.

v2: add missing NULL check
v3: fix another NULL pointer deref

Signed-off-by: Christian König
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 27 --
  1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index ef4fe2df8398..dc10bea836db 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -643,6 +643,7 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct 
amdgpu_job *job,
bool need_pipe_sync)
  {
struct amdgpu_device *adev = ring->adev;
+   struct amdgpu_isolation *isolation = &adev->isolation[ring->xcp_id];
unsigned vmhub = ring->vm_hub;
struct amdgpu_vmid_mgr *id_mgr = &adev->vm_manager.id_mgr[vmhub];
struct amdgpu_vmid *id = &id_mgr->ids[job->vmid];
@@ -650,8 +651,9 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct 
amdgpu_job *job,
bool gds_switch_needed = ring->funcs->emit_gds_switch &&
job->gds_switch_needed;
bool vm_flush_needed = job->vm_needs_flush;
-   struct dma_fence *fence = NULL;
+   bool cleaner_shader_needed = false;
bool pasid_mapping_needed = false;
+   struct dma_fence *fence = NULL;
unsigned int patch;
int r;
  
@@ -674,8 +676,12 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct amdgpu_job *job,

pasid_mapping_needed &= adev->gmc.gmc_funcs->emit_pasid_mapping &&
ring->funcs->emit_wreg;
  
+	cleaner_shader_needed = adev->gfx.enable_cleaner_shader &&

+   ring->funcs->emit_cleaner_shader && job->base.s_fence &&
+   &job->base.s_fence->scheduled == isolation->spearhead;
+
if (!vm_flush_needed && !gds_switch_needed && !need_pipe_sync &&
-   !(job->enforce_isolation && !job->vmid))
+   !cleaner_shader_needed)
return 0;
  
  	amdgpu_ring_ib_begin(ring);

@@ -686,9 +692,7 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct 
amdgpu_job *job,
if (need_pipe_sync)
amdgpu_ring_emit_pipeline_sync(ring);
  
-	if (adev->gfx.enable_cleaner_shader &&

-   ring->funcs->emit_cleaner_shader &&
-   job->enforce_isolation)
+   if (cleaner_shader_needed)


Here should we also need to check, for ring->funcs->emit_cleaner_shader?

if (cleaner_shader_needed && ring->funcs->emit_cleaner_shader)


ring->funcs->emit_cleaner_shader(ring);
  
  	if (vm_flush_needed) {

@@ -710,7 +714,7 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct 
amdgpu_job *job,
job->oa_size);
}
  
-	if (vm_flush_needed || pasid_mapping_needed) {

+   if (vm_flush_needed || pasid_mapping_needed || cleaner_shader_needed) {
r = amdgpu_fence_emit(ring, &fence, NULL, 0);
if (r)
return r;
@@ -732,6 +736,17 @@ int amdgpu_vm_flush(struct amdgpu_ring *ring, struct 
amdgpu_job *job,
id->pasid_mapping = dma_fence_get(fence);
mutex_unlock(&id_mgr->lock);
}
+
+   /*
+* Make sure that all other submissions wait for the cleaner shader to
+* finish before we push them to the HW.
+*/
+   if (cleaner_shader_needed) {
+   mutex_lock(&adev->enforce_isolation_mutex);
+   dma_fence_put(isolation->spearhead);
+   isolation->spearhead = dma_fence_get(fence);
+   mutex_unlock(&adev->enforce_isolation_mutex);
+   }
dma_fence_put(fence);
  
  	amdgpu_ring_patch_cond_exec(ring, patch);

Re: [PATCH] drm/amd/display: avoid NPD when ASIC does not support DMUB

2025-03-13 Thread Leo Li





On 2025-03-13 07:29, Thadeu Lima de Souza Cascardo wrote:

On Wed, Feb 05, 2025 at 10:06:38AM -0300, Thadeu Lima de Souza Cascardo wrote:

ctx->dmub_srv will de NULL if the ASIC does not support DMUB, which is
tested in dm_dmub_sw_init.

However, it will be dereferenced in dmub_hw_lock_mgr_cmd if
should_use_dmub_lock returns true.

This has been the case since dmub support has been added for PSR1.


This bug has landed on stable trees. Any chance for a review here?

Thanks.
Cascardo.


Thanks for the ping and fix!

Reviewed-by: Leo Li 





Fix this by checking for dmub_srv in should_use_dmub_lock.

[   37.440832] BUG: kernel NULL pointer dereference, address: 0058
[   37.447808] #PF: supervisor read access in kernel mode
[   37.452959] #PF: error_code(0x) - not-present page
[   37.458112] PGD 0 P4D 0
[   37.460662] Oops: Oops:  [#1] PREEMPT SMP NOPTI
[   37.465553] CPU: 2 UID: 1000 PID: 1745 Comm: DrmThread Not tainted 
6.14.0-rc1-3-gd62e938120f0 #23 99720e1cb1e0fc4773b8513150932a07de3c6e88
[   37.478324] Hardware name: Google Morphius/Morphius, BIOS 
Google_Morphius.13434.858.0 10/26/2023
[   37.487103] RIP: 0010:dmub_hw_lock_mgr_cmd+0x77/0xb0
[   37.492074] Code: 44 24 0e 00 00 00 00 48 c7 04 24 45 00 00 0c 40 88 74 24 0d 0f 
b6 02 88 44 24 0c 8b 01 89 44 24 08 85 f6 75 05 c6 44 24 0e 01 <48> 8b 7f 58 48 
89 e6 ba 01 00 00 00 e8 08 3c 2a 00 65 48 8b 04 5
[   37.510822] RSP: 0018:969442853300 EFLAGS: 00010202
[   37.516052] RAX:  RBX: 92db0300 RCX: 969442853358
[   37.523185] RDX: 969442853368 RSI: 0001 RDI: 
[   37.530322] RBP: 0001 R08: 04a7 R09: 04a5
[   37.537453] R10: 0476 R11: 0062 R12: 92db0ade8000
[   37.544589] R13: 92da01180ae0 R14: 92da011802a8 R15: 92db0300
[   37.551725] FS:  784a9cdfc6c0() GS:92db2af0() 
knlGS:
[   37.559814] CS:  0010 DS:  ES:  CR0: 80050033
[   37.565562] CR2: 0058 CR3: 000112b1c000 CR4: 003506f0
[   37.572697] Call Trace:
[   37.575152]  
[   37.577258]  ? __die_body+0x66/0xb0
[   37.580756]  ? page_fault_oops+0x3e7/0x4a0
[   37.584861]  ? exc_page_fault+0x3e/0xe0
[   37.588706]  ? exc_page_fault+0x5c/0xe0
[   37.592550]  ? asm_exc_page_fault+0x22/0x30
[   37.596742]  ? dmub_hw_lock_mgr_cmd+0x77/0xb0
[   37.601107]  dcn10_cursor_lock+0x1e1/0x240
[   37.605211]  program_cursor_attributes+0x81/0x190
[   37.609923]  commit_planes_for_stream+0x998/0x1ef0
[   37.614722]  update_planes_and_stream_v2+0x41e/0x5c0
[   37.619703]  dc_update_planes_and_stream+0x78/0x140
[   37.624588]  amdgpu_dm_atomic_commit_tail+0x4362/0x49f0
[   37.629832]  ? srso_return_thunk+0x5/0x5f
[   37.633847]  ? mark_held_locks+0x6d/0xd0
[   37.637774]  ? _raw_spin_unlock_irq+0x24/0x50
[   37.642135]  ? srso_return_thunk+0x5/0x5f
[   37.646148]  ? lockdep_hardirqs_on+0x95/0x150
[   37.650510]  ? srso_return_thunk+0x5/0x5f
[   37.654522]  ? _raw_spin_unlock_irq+0x2f/0x50
[   37.658883]  ? srso_return_thunk+0x5/0x5f
[   37.662897]  ? wait_for_common+0x186/0x1c0
[   37.666998]  ? srso_return_thunk+0x5/0x5f
[   37.671009]  ? drm_crtc_next_vblank_start+0xc3/0x170
[   37.675983]  commit_tail+0xf5/0x1c0
[   37.679478]  drm_atomic_helper_commit+0x2a2/0x2b0
[   37.684186]  drm_atomic_commit+0xd6/0x100
[   37.688199]  ? __cfi___drm_printfn_info+0x10/0x10
[   37.692911]  drm_atomic_helper_update_plane+0xe5/0x130
[   37.698054]  drm_mode_cursor_common+0x501/0x670
[   37.702600]  ? __cfi_drm_mode_cursor_ioctl+0x10/0x10
[   37.707572]  drm_mode_cursor_ioctl+0x48/0x70
[   37.711851]  drm_ioctl_kernel+0xf2/0x150
[   37.715781]  drm_ioctl+0x363/0x590
[   37.719189]  ? __cfi_drm_mode_cursor_ioctl+0x10/0x10
[   37.724165]  amdgpu_drm_ioctl+0x41/0x80
[   37.728013]  __se_sys_ioctl+0x7f/0xd0
[   37.731685]  do_syscall_64+0x87/0x100
[   37.735355]  ? vma_end_read+0x12/0xe0
[   37.739024]  ? srso_return_thunk+0x5/0x5f
[   37.743041]  ? find_held_lock+0x47/0xf0
[   37.746884]  ? vma_end_read+0x12/0xe0
[   37.750552]  ? srso_return_thunk+0x5/0x5f
[   37.754565]  ? lock_release+0x1c4/0x2e0
[   37.758406]  ? vma_end_read+0x12/0xe0
[   37.762079]  ? exc_page_fault+0x84/0xe0
[   37.765921]  ? srso_return_thunk+0x5/0x5f
[   37.769938]  ? lockdep_hardirqs_on+0x95/0x150
[   37.774303]  ? srso_return_thunk+0x5/0x5f
[   37.778317]  ? exc_page_fault+0x84/0xe0
[   37.782163]  entry_SYSCALL_64_after_hwframe+0x55/0x5d
[   37.787218] RIP: 0033:0x784aa5ec3059
[   37.790803] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 
00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 
f0 ff ff 77 1d 48 8b 45 c8 64 48 2b 04 25 28 00 0
[   37.809553] RSP: 002b:784a9cdf90e0 EFLAGS: 0246 ORIG_RAX: 
0010
[   37.817121] RAX: ffda RBX: 784a9cdf917c RCX: 784aa5ec3059
[   37.824256] RDX: 784a9cdf917c RSI: c01c64a3 RDI: 0

Re: [PATCH] drm/amd/display: avoid NPD when ASIC does not support DMUB

2025-03-13 Thread Thadeu Lima de Souza Cascardo

On Wed, Feb 05, 2025 at 10:06:38AM -0300, Thadeu Lima de Souza Cascardo wrote:
> ctx->dmub_srv will de NULL if the ASIC does not support DMUB, which is
> tested in dm_dmub_sw_init.
> 
> However, it will be dereferenced in dmub_hw_lock_mgr_cmd if
> should_use_dmub_lock returns true.
> 
> This has been the case since dmub support has been added for PSR1.

This bug has landed on stable trees. Any chance for a review here?

Thanks.
Cascardo.

> 
> Fix this by checking for dmub_srv in should_use_dmub_lock.
> 
> [   37.440832] BUG: kernel NULL pointer dereference, address: 0058
> [   37.447808] #PF: supervisor read access in kernel mode
> [   37.452959] #PF: error_code(0x) - not-present page
> [   37.458112] PGD 0 P4D 0
> [   37.460662] Oops: Oops:  [#1] PREEMPT SMP NOPTI
> [   37.465553] CPU: 2 UID: 1000 PID: 1745 Comm: DrmThread Not tainted 
> 6.14.0-rc1-3-gd62e938120f0 #23 99720e1cb1e0fc4773b8513150932a07de3c6e88
> [   37.478324] Hardware name: Google Morphius/Morphius, BIOS 
> Google_Morphius.13434.858.0 10/26/2023
> [   37.487103] RIP: 0010:dmub_hw_lock_mgr_cmd+0x77/0xb0
> [   37.492074] Code: 44 24 0e 00 00 00 00 48 c7 04 24 45 00 00 0c 40 88 74 24 
> 0d 0f b6 02 88 44 24 0c 8b 01 89 44 24 08 85 f6 75 05 c6 44 24 0e 01 <48> 8b 
> 7f 58 48 89 e6 ba 01 00 00 00 e8 08 3c 2a 00 65 48 8b 04 5
> [   37.510822] RSP: 0018:969442853300 EFLAGS: 00010202
> [   37.516052] RAX:  RBX: 92db0300 RCX: 
> 969442853358
> [   37.523185] RDX: 969442853368 RSI: 0001 RDI: 
> 
> [   37.530322] RBP: 0001 R08: 04a7 R09: 
> 04a5
> [   37.537453] R10: 0476 R11: 0062 R12: 
> 92db0ade8000
> [   37.544589] R13: 92da01180ae0 R14: 92da011802a8 R15: 
> 92db0300
> [   37.551725] FS:  784a9cdfc6c0() GS:92db2af0() 
> knlGS:
> [   37.559814] CS:  0010 DS:  ES:  CR0: 80050033
> [   37.565562] CR2: 0058 CR3: 000112b1c000 CR4: 
> 003506f0
> [   37.572697] Call Trace:
> [   37.575152]  
> [   37.577258]  ? __die_body+0x66/0xb0
> [   37.580756]  ? page_fault_oops+0x3e7/0x4a0
> [   37.584861]  ? exc_page_fault+0x3e/0xe0
> [   37.588706]  ? exc_page_fault+0x5c/0xe0
> [   37.592550]  ? asm_exc_page_fault+0x22/0x30
> [   37.596742]  ? dmub_hw_lock_mgr_cmd+0x77/0xb0
> [   37.601107]  dcn10_cursor_lock+0x1e1/0x240
> [   37.605211]  program_cursor_attributes+0x81/0x190
> [   37.609923]  commit_planes_for_stream+0x998/0x1ef0
> [   37.614722]  update_planes_and_stream_v2+0x41e/0x5c0
> [   37.619703]  dc_update_planes_and_stream+0x78/0x140
> [   37.624588]  amdgpu_dm_atomic_commit_tail+0x4362/0x49f0
> [   37.629832]  ? srso_return_thunk+0x5/0x5f
> [   37.633847]  ? mark_held_locks+0x6d/0xd0
> [   37.637774]  ? _raw_spin_unlock_irq+0x24/0x50
> [   37.642135]  ? srso_return_thunk+0x5/0x5f
> [   37.646148]  ? lockdep_hardirqs_on+0x95/0x150
> [   37.650510]  ? srso_return_thunk+0x5/0x5f
> [   37.654522]  ? _raw_spin_unlock_irq+0x2f/0x50
> [   37.658883]  ? srso_return_thunk+0x5/0x5f
> [   37.662897]  ? wait_for_common+0x186/0x1c0
> [   37.666998]  ? srso_return_thunk+0x5/0x5f
> [   37.671009]  ? drm_crtc_next_vblank_start+0xc3/0x170
> [   37.675983]  commit_tail+0xf5/0x1c0
> [   37.679478]  drm_atomic_helper_commit+0x2a2/0x2b0
> [   37.684186]  drm_atomic_commit+0xd6/0x100
> [   37.688199]  ? __cfi___drm_printfn_info+0x10/0x10
> [   37.692911]  drm_atomic_helper_update_plane+0xe5/0x130
> [   37.698054]  drm_mode_cursor_common+0x501/0x670
> [   37.702600]  ? __cfi_drm_mode_cursor_ioctl+0x10/0x10
> [   37.707572]  drm_mode_cursor_ioctl+0x48/0x70
> [   37.711851]  drm_ioctl_kernel+0xf2/0x150
> [   37.715781]  drm_ioctl+0x363/0x590
> [   37.719189]  ? __cfi_drm_mode_cursor_ioctl+0x10/0x10
> [   37.724165]  amdgpu_drm_ioctl+0x41/0x80
> [   37.728013]  __se_sys_ioctl+0x7f/0xd0
> [   37.731685]  do_syscall_64+0x87/0x100
> [   37.735355]  ? vma_end_read+0x12/0xe0
> [   37.739024]  ? srso_return_thunk+0x5/0x5f
> [   37.743041]  ? find_held_lock+0x47/0xf0
> [   37.746884]  ? vma_end_read+0x12/0xe0
> [   37.750552]  ? srso_return_thunk+0x5/0x5f
> [   37.754565]  ? lock_release+0x1c4/0x2e0
> [   37.758406]  ? vma_end_read+0x12/0xe0
> [   37.762079]  ? exc_page_fault+0x84/0xe0
> [   37.765921]  ? srso_return_thunk+0x5/0x5f
> [   37.769938]  ? lockdep_hardirqs_on+0x95/0x150
> [   37.774303]  ? srso_return_thunk+0x5/0x5f
> [   37.778317]  ? exc_page_fault+0x84/0xe0
> [   37.782163]  entry_SYSCALL_64_after_hwframe+0x55/0x5d
> [   37.787218] RIP: 0033:0x784aa5ec3059
> [   37.790803] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 
> 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <41> 89 
> c0 3d 00 f0 ff ff 77 1d 48 8b 45 c8 64 48 2b 04 25 28 00 0
> [   37.809553] RSP: 002b:784a9cdf90e0 EFLAGS: 0246 ORIG_RAX: 
> 0010
> [   37.817121] RAX: ffda RBX: 784a9cdf917c RCX:

Re: [PATCH] drm/amdgpu/gfx12: correct cleanup of 'me' field with gfx_v12_0_me_fini()

2025-03-13 Thread Alex Deucher

Applied.  Thanks!

Alex

On Wed, Mar 12, 2025 at 6:09 AM Wentao Liang  wrote:
>
> In gfx_v12_0_cp_gfx_load_me_microcode_rs64(), gfx_v12_0_pfp_fini() is
> incorrectly used to free 'me' field of 'gfx', since gfx_v12_0_pfp_fini()
> can only release 'pfp' field of 'gfx'. The release function of 'me' field
> should be gfx_v12_0_me_fini().
>
> Fixes: 52cb80c12e8a ("drm/amdgpu: Add gfx v12_0 ip block support (v6)")
> Cc: sta...@vger.kernel.org # 6.11+
> Signed-off-by: Wentao Liang 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> index da327ab48a57..02bc2eddf0c0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
> @@ -2413,7 +2413,7 @@ static int 
> gfx_v12_0_cp_gfx_load_me_microcode_rs64(struct amdgpu_device *adev)
>   (void **)&adev->gfx.me.me_fw_data_ptr);
> if (r) {
> dev_err(adev->dev, "(%d) failed to create me data bo\n", r);
> -   gfx_v12_0_pfp_fini(adev);
> +   gfx_v12_0_me_fini(adev);
> return r;
> }
>
> --
> 2.42.0.windows.2
>

Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks

2025-03-13 Thread Alex Deucher

I think as long as the locking is correct, the src shouldn't matter.
You just need to stop the kernel queues (and save state) and evict the
user queues (since HWS is responsible for saving the MQDs of the
non-guilty user queues).  If KFD detected the hang (e.g., queue
eviction fails when called for memory pressure, etc.), we just need to
make sure that it's ok for the sdma reset routine to call evict queues
again even if it was already called (presumably it should exit early
since the queues are already evicted).  If KGD initiates the reset, it
will call into KFD to evict queues.  We just need to make sure the
evict queues function we call just evicts the queues and doesn't also
try and reset.

Alex

On Wed, Mar 12, 2025 at 5:24 AM Zhang, Jesse(Jie)  wrote:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
>
>
>
>
> From: Koenig, Christian 
> Sent: Wednesday, March 12, 2025 4:39 PM
> To: Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander ; Kim, Jonathan 
> ; Zhu, Jiadong 
> Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by 
> removing dynamic callbacks
>
>
>
> Am 12.03.25 um 09:15 schrieb Zhang, Jesse(Jie):
>
> [SNIP9
>
> -
>
> + gfx_ring->funcs->stop_queue(adev, instance_id);
>
>
>
> Yeah that starts to look good. Question here is who is calling 
> amdgpu_sdma_reset_engine()?
>
>
>
> If this call comes from engine specific code we might not need the 
> start/stop_queue callbacks all together.
>
>
>
> Kfd and sdma v4/v5/v5_2 will call amdgpu_sdma_reset_engine, and 
> start/stop_queue callbacks are only implemented in sdmav4/sdmav5/sdma5_2.
>
>
> Why would the KFD call this as well? Because it detects an issue with a SDMA 
> user queue  If yes I would rather suggest that the KFD calls the reset 
> function of the paging queue.
>
> Since this reset function is specific to the SDMA HW generation anyway you 
> don't need those extra functions to abstract starting and stopping of the 
> queue for each HW generation.
>
> kfd can't call reset function directly, unless we add a parameter src  to 
> distinguish kfd and kgd in reset function, like this:
>
> int (*reset)(struct amdgpu_ring *ring, unsigned int vmid, int src );
>
> As Alex said in another thread,
>
> We need to distinguish  kfd and kgd  in reset.
>
> If kfd triggers a reset, kgd must save healthy jobs and recover jobs after 
> reset.
>
> If kgd triggers a reset, kgd must abandon bad jobs after reset.(and perhaps 
> kfd needs to save its healthy jobs for recovery).
>
>
>
> If we can add a parameter, I am ok for that solution too.
>
>
>
> Additionally:
>
> For sdma6/7, when a queue reset fails, we may need to fall back to an engine 
> reset for a attempt.
>
>
>
> Thanks
>
> Jesse
>
>
> Regards,
> Christian.
>
>
>
>
>
>
> Thanks
>
> Jesse
>
>
>
> Regards,
>
> Christian.
>
>
>
>   /* Perform the SDMA reset for the specified instance */
>
>   ret = amdgpu_dpm_reset_sdma(adev, 1 << instance_id);
>
>   if (ret) {
>
> @@ -591,18 +573,7 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device *adev, 
> uint32_t instance_id, b
>
>   goto exit;
>
>   }
>
>
>
> - /* Invoke all registered post_reset callbacks */
>
> - list_for_each_entry(funcs, &adev->sdma.reset_callback_list, list) {
>
> - if (funcs->post_reset) {
>
> - ret = funcs->post_reset(adev, instance_id);
>
> - if (ret) {
>
> - dev_err(adev->dev,
>
> - "afterReset callback failed for instance %u: 
> %d\n",
>
> - instance_id, ret);
>
> - goto exit;
>
> - }
>
> - }
>
> - }
>
> + gfx_ring->funcs->start_queue(adev, instance_id);
>
>
>
>  exit:
>
>   /* Restart the scheduler's work queue for the GFX and page rings
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> index fd34dc138081..c1f7ccff9c4e 100644
>
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> @@ -2132,6 +2132,8 @@ static const struct amdgpu_ring_funcs 
> sdma_v4_4_2_ring_funcs = {
>
>   .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
>
>   .emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper,
>
>   .reset = sdma_v4_4_2_reset_queue,
>
> + .stop_queue = sdma_v4_4_2_stop_queue,
>
> + .start_queue = sdma_v4_4_2_restore_queue,
>
>   .is_guilty = sdma_v4_4_2_ring_is_guilty,  };
>
>
>
>
>
>

Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks

2025-03-13 Thread Christian König

Am 12.03.25 um 10:23 schrieb Zhang, Jesse(Jie):
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
>  
>
>  
>
> *From:*Koenig, Christian 
> *Sent:* Wednesday, March 12, 2025 4:39 PM
> *To:* Zhang, Jesse(Jie) ; amd-gfx@lists.freedesktop.org
> *Cc:* Deucher, Alexander ; Kim, Jonathan 
> ; Zhu, Jiadong 
> *Subject:* Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by 
> removing dynamic callbacks
>
>  
>
> Am 12.03.25 um 09:15 schrieb Zhang, Jesse(Jie):
>
> [SNIP9
>
> -
>
> + gfx_ring->funcs->stop_queue(adev, instance_id);
>
>  
>
> Yeah that starts to look good. Question here is who is calling 
> amdgpu_sdma_reset_engine()?
>
>  
>
> If this call comes from engine specific code we might not need the 
> start/stop_queue callbacks all together.
>
>  
>
>     Kfd and sdma v4/v5/v5_2 will call amdgpu_sdma_reset_engine, and 
> start/stop_queue callbacks are only implemented in sdmav4/sdmav5/sdma5_2.
>
>
> Why would the KFD call this as well? Because it detects an issue with a SDMA 
> user queue  If yes I would rather suggest that the KFD calls the reset 
> function of the paging queue.
>
> Since this reset function is specific to the SDMA HW generation anyway you 
> don't need those extra functions to abstract starting and stopping of the 
> queue for each HW generation.
>
> kfd can't call reset function directly, unless we add a parameter src  to 
> distinguish kfd and kgd in reset function, like this:
>
> int (*reset)(struct amdgpu_ring *ring, unsigned int vmid, */int src/* );
>
> As Alex said in another thread,
>
> We need to distinguish  kfd and kgd  in reset.
>
> If kfd triggers a reset, kgd must save healthy jobs and recover jobs after 
> reset.
>
> If kgd triggers a reset, kgd must abandon bad jobs after reset.(and perhaps 
> kfd needs to save its healthy jobs for recovery).
>

I don't think the source of the reset should be relevant to the reset procedure.

The source is basically just the first one who runs into a timeout, that can be 
both KFD and KGD.

But the cause of the timeout is not necessary the one who signals that a 
timeout happens.

So as far as I can see you should not have that as parameter either.

>  
>
> If we can add a parameter, I am ok for that solution too.
>
>  
>
> Additionally:
>
> For sdma6/7, when a queue reset fails, we may need to fall back to an engine 
> reset for a attempt.
>

Yeah, but that should be trivial.

Regards,
Christian.

>  
>
> Thanks
>
> Jesse
>
>
> Regards,
> Christian.
>
>
>  
>
>  
>
> Thanks
>
> Jesse
>
>  
>
> Regards,
>
> Christian.
>
>  
>
>   /* Perform the SDMA reset for the specified instance */
>
>   ret = amdgpu_dpm_reset_sdma(adev, 1 << instance_id);
>
>   if (ret) {
>
> @@ -591,18 +573,7 @@ int amdgpu_sdma_reset_engine(struct 
> amdgpu_device *adev, uint32_t instance_id, b
>
>   goto exit;
>
>   }
>
>  
>
> - /* Invoke all registered post_reset callbacks */
>
> - list_for_each_entry(funcs, &adev->sdma.reset_callback_list, 
> list) {
>
> - if (funcs->post_reset) {
>
> - ret = funcs->post_reset(adev, instance_id);
>
> - if (ret) {
>
> - dev_err(adev->dev,
>
> - "afterReset callback failed for 
> instance %u: %d\n",
>
> - instance_id, ret);
>
> - goto exit;
>
> - }
>
> - }
>
> - }
>
> + gfx_ring->funcs->start_queue(adev, instance_id);
>
>  
>
>  exit:
>
>   /* Restart the scheduler's work queue for the GFX and page rings
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> index fd34dc138081..c1f7ccff9c4e 100644
>
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
> @@ -2132,6 +2132,8 @@ static const struct amdgpu_ring_funcs 
> sdma_v4_4_2_ring_funcs = {
>
>   .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
>
>   .emit_reg_write_reg_wait = 
> amdgpu_ring_emit_reg_write_reg_wait_helper,
>
>   .reset = sdma_v4_4_2_reset_queue,
>
> + .stop_queue = sdma_v4_4_2_stop_queue,
>
> + .start_queue = sdma_v4_4_2_restore_queue,
>
>   .is_guilty = sdma_v4_4_2_ring_is_guilty,  };
>
>  
>
>  
>
>  
>

[PATCH 08/11] drm/amdgpu/gfx12: add support for disable_kq

2025-03-13 Thread Alex Deucher

Plumb in support for disabling kernel queues.

v2: use ring counts per Felix' suggestion
v3: fix stream fault handler, enable EOP interrupts
v4: fix MEC interrupt offset (Sunil)

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c | 183 +
 1 file changed, 125 insertions(+), 58 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
index 34cf187e72d9f..a99507e4fdb27 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c
@@ -1421,11 +1421,13 @@ static int gfx_v12_0_sw_init(struct amdgpu_ip_block 
*ip_block)
break;
}
 
-   /* recalculate compute rings to use based on hardware configuration */
-   num_compute_rings = (adev->gfx.mec.num_pipe_per_mec *
-adev->gfx.mec.num_queue_per_pipe) / 2;
-   adev->gfx.num_compute_rings = min(adev->gfx.num_compute_rings,
- num_compute_rings);
+   if (adev->gfx.num_compute_rings) {
+   /* recalculate compute rings to use based on hardware 
configuration */
+   num_compute_rings = (adev->gfx.mec.num_pipe_per_mec *
+adev->gfx.mec.num_queue_per_pipe) / 2;
+   adev->gfx.num_compute_rings = min(adev->gfx.num_compute_rings,
+ num_compute_rings);
+   }
 
/* EOP Event */
r = amdgpu_irq_add_id(adev, SOC21_IH_CLIENTID_GRBM_CP,
@@ -1471,37 +1473,41 @@ static int gfx_v12_0_sw_init(struct amdgpu_ip_block 
*ip_block)
return r;
}
 
-   /* set up the gfx ring */
-   for (i = 0; i < adev->gfx.me.num_me; i++) {
-   for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) {
-   for (k = 0; k < adev->gfx.me.num_pipe_per_me; k++) {
-   if (!amdgpu_gfx_is_me_queue_enabled(adev, i, k, 
j))
-   continue;
-
-   r = gfx_v12_0_gfx_ring_init(adev, ring_id,
-   i, k, j);
-   if (r)
-   return r;
-   ring_id++;
+   if (adev->gfx.num_gfx_rings) {
+   /* set up the gfx ring */
+   for (i = 0; i < adev->gfx.me.num_me; i++) {
+   for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) {
+   for (k = 0; k < adev->gfx.me.num_pipe_per_me; 
k++) {
+   if 
(!amdgpu_gfx_is_me_queue_enabled(adev, i, k, j))
+   continue;
+
+   r = gfx_v12_0_gfx_ring_init(adev, 
ring_id,
+   i, k, j);
+   if (r)
+   return r;
+   ring_id++;
+   }
}
}
}
 
-   ring_id = 0;
-   /* set up the compute queues - allocate horizontally across pipes */
-   for (i = 0; i < adev->gfx.mec.num_mec; ++i) {
-   for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; j++) {
-   for (k = 0; k < adev->gfx.mec.num_pipe_per_mec; k++) {
-   if (!amdgpu_gfx_is_mec_queue_enabled(adev,
-   0, i, k, j))
-   continue;
+   if (adev->gfx.num_compute_rings) {
+   ring_id = 0;
+   /* set up the compute queues - allocate horizontally across 
pipes */
+   for (i = 0; i < adev->gfx.mec.num_mec; ++i) {
+   for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; j++) {
+   for (k = 0; k < adev->gfx.mec.num_pipe_per_mec; 
k++) {
+   if 
(!amdgpu_gfx_is_mec_queue_enabled(adev,
+0, 
i, k, j))
+   continue;
 
-   r = gfx_v12_0_compute_ring_init(adev, ring_id,
-   i, k, j);
-   if (r)
-   return r;
+   r = gfx_v12_0_compute_ring_init(adev, 
ring_id,
+   i, k, 
j);
+   if (r)
+   return r;
 
-   ring_id++;
+   ring_id++;
+   }
}

Re: [PATCH 1/2] drm: Create an app info option for wedge events

2025-03-13 Thread Raag Jadav

On Wed, Mar 12, 2025 at 06:59:33PM -0300, André Almeida wrote:
> Em 12/03/2025 07:06, Raag Jadav escreveu:
> > On Tue, Mar 11, 2025 at 07:09:45PM +0200, Raag Jadav wrote:
> > > On Mon, Mar 10, 2025 at 06:27:53PM -0300, André Almeida wrote:
> > > > Em 01/03/2025 02:53, Raag Jadav escreveu:
> > > > > On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote:
> > > > > > Hi Raag,
> > > > > > 
> > > > > > On 2/28/25 11:20, Raag Jadav wrote:
> > > > > > > Cc: Lucas
> > > > > > > 
> > > > > > > On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:
> > > > > > > > When a device get wedged, it might be caused by a guilty 
> > > > > > > > application.
> > > > > > > > For userspace, knowing which app was the cause can be useful 
> > > > > > > > for some
> > > > > > > > situations, like for implementing a policy, logs or for giving 
> > > > > > > > a chance
> > > > > > > > for the compositor to let the user know what app caused the 
> > > > > > > > problem.
> > > > > > > > This is an optional argument, when `PID=-1` there's no 
> > > > > > > > information about
> > > > > > > > the app caused the problem, or if any app was involved during 
> > > > > > > > the hang.
> > > > > > > > 
> > > > > > > > Sometimes just the PID isn't enough giving that the app might 
> > > > > > > > be already
> > > > > > > > dead by the time userspace will try to check what was this 
> > > > > > > > PID's name,
> > > > > > > > so to make the life easier also notify what's the app's name in 
> > > > > > > > the user
> > > > > > > > event.
> > > > > > > > 
> > > > > > > > Signed-off-by: André Almeida 
> > > > 
> > > > [...]
> > > > 
> > > > > > > > len = scnprintf(event_string, sizeof(event_string), 
> > > > > > > > "%s", "WEDGED=");
> > > > > > > > @@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device 
> > > > > > > > *dev, unsigned long method)
> > > > > > > > drm_info(dev, "device wedged, %s\n", method == 
> > > > > > > > DRM_WEDGE_RECOVERY_NONE ?
> > > > > > > >  "but recovered through reset" : "needs 
> > > > > > > > recovery");
> > > > > > > > +   if (info) {
> > > > > > > > +   snprintf(pid_string, sizeof(pid_string), 
> > > > > > > > "PID=%u", info->pid);
> > > > > > > > +   snprintf(comm_string, sizeof(comm_string), 
> > > > > > > > "APP=%s", info->comm);
> > > > > > > > +   } else {
> > > > > > > > +   snprintf(pid_string, sizeof(pid_string), "%s", 
> > > > > > > > "PID=-1");
> > > > > > > > +   snprintf(comm_string, sizeof(comm_string), 
> > > > > > > > "%s", "APP=none");
> > > > > > > > +   }
> > > > > > > This is not much use for wedge cases that needs recovery, since 
> > > > > > > at that point
> > > > > > > the userspace will need to clean house anyway.
> > > > > > > 
> > > > > > > Which leaves us with only 'none' case and perhaps the need for 
> > > > > > > standardization
> > > > > > > of "optional telemetry collection".
> > > > > > > 
> > > > > > > Thoughts?
> > > > > > 
> > > > > > I had the feeling that 'none' was already meant to be used for 
> > > > > > that. Do you
> > > > > > think we should move to another naming? Given that we didn't reach 
> > > > > > the merge
> > > > > > window yet we could potentially change that name without much 
> > > > > > damage.
> > > > > 
> > > > > No, I meant thoughts on possible telemetry data that the drivers might
> > > > > think is useful for userspace (along with PID) and can be presented in
> > > > > a vendor agnostic manner (just like wedged event).
> > > > 
> > > > I'm not if I agree that this will only be used for telemetry and for the
> > > > `none` use case. As stated by Xaver, there's use case to know which app
> > > > caused the device to get wedged (like switching to software rendering) 
> > > > and
> > > > to display something for the user after the recovery is done (e.g. "The 
> > > > game
> > > >  stopped working and Plasma has reset").
> > > 
> > > Sure, but since this information is already available in coredump, I was
> > > hoping to have something like a standardized DRM level coredump with both
> > > vendor specific and agnostic sections, which the drivers can (and 
> > > hopefully
> > > transition to) use in conjunction with wedged event to provide wider
> > > telemetry and is useful for all wedge cases.
> > 
> > This is more useful because,
> > 
> > 1. It gives drivers an opportunity to present the telemetry that they are
> > interested in without needing to add a new event string (like PID or 
> > APP)
> > for their case.
> > 
> > 2. When we consider wedging as a usecase, there's a lot more that goes
> > into it than an application that might be behaving strangely. So a wider
> > telemetry is what I would hope to look at in such a scenario.
> > 
> 
> I agree that coredump is the way to go for telemetry, we already have the
> name and PID of the guilty app there, along with more information about the
> GPU state. But I don't think it sh

[PATCH 05/11] drm/amdgpu/mes: update hqd masks when disable_kq is set

2025-03-13 Thread Alex Deucher

Make all resources available to user queues.

Suggested-by: Sunil Khatri 
Reviewed-by: Sunil Khatri 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 5abc1ca0fee98..971bf01fe46a9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -120,21 +120,21 @@ int amdgpu_mes_init(struct amdgpu_device *adev)
 * Set GFX pipe 0 queue 1-7 for MES scheduling
 * mask =  1110b
 */
-   adev->mes.gfx_hqd_mask[i] = 0xFE;
+   adev->mes.gfx_hqd_mask[i] = adev->gfx.disable_kq ? 0xFF 
: 0xFE;
else
/*
 * GFX pipe 0 queue 0 is being used by Kernel queue.
 * Set GFX pipe 0 queue 1 for MES scheduling
 * mask = 10b
 */
-   adev->mes.gfx_hqd_mask[i] = 0x2;
+   adev->mes.gfx_hqd_mask[i] = adev->gfx.disable_kq ? 0x3 
: 0x2;
}
 
for (i = 0; i < AMDGPU_MES_MAX_COMPUTE_PIPES; i++) {
/* use only 1st MEC pipes */
if (i >= adev->gfx.mec.num_pipe_per_mec)
continue;
-   adev->mes.compute_hqd_mask[i] = 0xc;
+   adev->mes.compute_hqd_mask[i] = adev->gfx.disable_kq ? 0xF : 
0xC;
}
 
for (i = 0; i < AMDGPU_MES_MAX_SDMA_PIPES; i++) {
-- 
2.48.1

[PATCH 02/11] drm/amdgpu: add ring flag for no user submissions

2025-03-13 Thread Alex Deucher

This would be set by IPs which only accept submissions
from the kernel, not userspace, such as when kernel
queues are disabled. Don't expose the rings to userspace
and reject any submissions in the CS IOCTL.

Reviewed-by: Sunil Khatri
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c   |  4 
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  | 30 
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  2 +-
 3 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 5df21529b3b13..5cc18034b75df 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -349,6 +349,10 @@ static int amdgpu_cs_p2_ib(struct amdgpu_cs_parser *p,
ring = amdgpu_job_ring(job);
ib = &job->ibs[job->num_ibs++];
 
+   /* submissions to kernel queus are disabled */
+   if (ring->no_user_submission)
+   return -EINVAL;
+
/* MM engine doesn't support user fences */
if (p->uf_bo && ring->funcs->no_user_fence)
return -EINVAL;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index cd6eb7a3bc58a..3b7dfd56ccd0e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -408,7 +408,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
case AMDGPU_HW_IP_GFX:
type = AMD_IP_BLOCK_TYPE_GFX;
for (i = 0; i < adev->gfx.num_gfx_rings; i++)
-   if (adev->gfx.gfx_ring[i].sched.ready)
+   if (adev->gfx.gfx_ring[i].sched.ready &&
+   !adev->gfx.gfx_ring[i].no_user_submission)
++num_rings;
ib_start_alignment = 32;
ib_size_alignment = 32;
@@ -416,7 +417,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
case AMDGPU_HW_IP_COMPUTE:
type = AMD_IP_BLOCK_TYPE_GFX;
for (i = 0; i < adev->gfx.num_compute_rings; i++)
-   if (adev->gfx.compute_ring[i].sched.ready)
+   if (adev->gfx.compute_ring[i].sched.ready &&
+   !adev->gfx.compute_ring[i].no_user_submission)
++num_rings;
ib_start_alignment = 32;
ib_size_alignment = 32;
@@ -424,7 +426,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
case AMDGPU_HW_IP_DMA:
type = AMD_IP_BLOCK_TYPE_SDMA;
for (i = 0; i < adev->sdma.num_instances; i++)
-   if (adev->sdma.instance[i].ring.sched.ready)
+   if (adev->sdma.instance[i].ring.sched.ready &&
+   !adev->gfx.gfx_ring[i].no_user_submission)
++num_rings;
ib_start_alignment = 256;
ib_size_alignment = 4;
@@ -435,7 +438,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
if (adev->uvd.harvest_config & (1 << i))
continue;
 
-   if (adev->uvd.inst[i].ring.sched.ready)
+   if (adev->uvd.inst[i].ring.sched.ready &&
+   !adev->uvd.inst[i].ring.no_user_submission)
++num_rings;
}
ib_start_alignment = 256;
@@ -444,7 +448,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
case AMDGPU_HW_IP_VCE:
type = AMD_IP_BLOCK_TYPE_VCE;
for (i = 0; i < adev->vce.num_rings; i++)
-   if (adev->vce.ring[i].sched.ready)
+   if (adev->vce.ring[i].sched.ready &&
+   !adev->vce.ring[i].no_user_submission)
++num_rings;
ib_start_alignment = 256;
ib_size_alignment = 4;
@@ -456,7 +461,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
continue;
 
for (j = 0; j < adev->uvd.num_enc_rings; j++)
-   if (adev->uvd.inst[i].ring_enc[j].sched.ready)
+   if (adev->uvd.inst[i].ring_enc[j].sched.ready &&
+   
!adev->uvd.inst[i].ring_enc[j].no_user_submission)
++num_rings;
}
ib_start_alignment = 256;
@@ -468,7 +474,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
if (adev->vcn.harvest_config & (1 << i))
continue;
 
-   if (adev->vcn.inst[i].ring_dec.sched.ready)
+   if (adev->vcn.inst[i].ring_dec.sched.ready &&
+   !adev->vcn.inst[i].ring_dec.no_user_submission)

[PATCH 06/11] drm/amdgpu/mes: make more vmids available when disable_kq=1

2025-03-13 Thread Alex Deucher

If we don't have kernel queues, the vmids can be used by
the MES for user queues.

Reviewed-by: Sunil Khatri 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 2 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 971bf01fe46a9..a536a78342a09 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -106,7 +106,7 @@ int amdgpu_mes_init(struct amdgpu_device *adev)
 
adev->mes.total_max_queue = AMDGPU_FENCE_MES_QUEUE_ID_MASK;
adev->mes.vmid_mask_mmhub = 0xff00;
-   adev->mes.vmid_mask_gfxhub = 0xff00;
+   adev->mes.vmid_mask_gfxhub = adev->gfx.disable_kq ? 0xfffe : 
0xff00;
 
for (i = 0; i < AMDGPU_MES_MAX_GFX_PIPES; i++) {
/* use only 1st ME pipe */
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index 95d894a231fcf..19a5f196829f3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -900,7 +900,7 @@ static int gmc_v10_0_sw_init(struct amdgpu_ip_block 
*ip_block)
 * amdgpu graphics/compute will use VMIDs 1-7
 * amdkfd will use VMIDs 8-15
 */
-   adev->vm_manager.first_kfd_vmid = 8;
+   adev->vm_manager.first_kfd_vmid = adev->gfx.disable_kq ? 1 : 8;
 
amdgpu_vm_manager_init(adev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
index bf8d01da88154..a2f6c9f4ebf2f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c
@@ -838,7 +838,7 @@ static int gmc_v12_0_sw_init(struct amdgpu_ip_block 
*ip_block)
 * amdgpu graphics/compute will use VMIDs 1-7
 * amdkfd will use VMIDs 8-15
 */
-   adev->vm_manager.first_kfd_vmid = 8;
+   adev->vm_manager.first_kfd_vmid = adev->gfx.disable_kq ? 1 : 8;
 
amdgpu_vm_manager_init(adev);
 
-- 
2.48.1

[PATCH 10/11] drm/amdgpu/sdma6: add support for disable_kq

2025-03-13 Thread Alex Deucher

When the parameter is set, disable user submissions
to kernel queues.

Reviewed-by: Sunil Khatri 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c 
b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
index 3aa4fec4d9e4a..bcc72737f8084 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c
@@ -1304,6 +1304,9 @@ static int sdma_v6_0_early_init(struct amdgpu_ip_block 
*ip_block)
struct amdgpu_device *adev = ip_block->adev;
int r;
 
+   if (amdgpu_disable_kq == 1)
+   adev->sdma.no_user_submission = true;
+
r = amdgpu_sdma_init_microcode(adev, 0, true);
if (r)
return r;
@@ -1338,6 +1341,7 @@ static int sdma_v6_0_sw_init(struct amdgpu_ip_block 
*ip_block)
ring->ring_obj = NULL;
ring->use_doorbell = true;
ring->me = i;
+   ring->no_user_submission = adev->sdma.no_user_submission;
 
DRM_DEBUG("SDMA %d use_doorbell being set to: [%s]\n", i,
ring->use_doorbell?"true":"false");
-- 
2.48.1

[PATCH 07/11] drm/amdgpu/gfx11: add support for disable_kq

2025-03-13 Thread Alex Deucher

Plumb in support for disabling kernel queues in
GFX11.  We have to bring up a GFX queue briefly in
order to initialize the clear state.  After that
we can disable it.

v2: use ring counts per Felix' suggestion
v3: fix stream fault handler, enable EOP interrupts
v4: fix MEC interrupt offset (Sunil)

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 191 ++---
 1 file changed, 136 insertions(+), 55 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
index 95eefd9a40d28..fde8464cbd3b3 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
@@ -1145,6 +1145,10 @@ static int gfx_v11_0_gfx_ring_init(struct amdgpu_device 
*adev, int ring_id,
 
ring->ring_obj = NULL;
ring->use_doorbell = true;
+   if (adev->gfx.disable_kq) {
+   ring->no_scheduler = true;
+   ring->no_user_submission = true;
+   }
 
if (!ring_id)
ring->doorbell_index = adev->doorbell_index.gfx_ring0 << 1;
@@ -1577,7 +1581,7 @@ static void gfx_v11_0_alloc_ip_dump(struct amdgpu_device 
*adev)
 
 static int gfx_v11_0_sw_init(struct amdgpu_ip_block *ip_block)
 {
-   int i, j, k, r, ring_id = 0;
+   int i, j, k, r, ring_id;
int xcc_id = 0;
struct amdgpu_device *adev = ip_block->adev;
 
@@ -1710,37 +1714,42 @@ static int gfx_v11_0_sw_init(struct amdgpu_ip_block 
*ip_block)
return r;
}
 
-   /* set up the gfx ring */
-   for (i = 0; i < adev->gfx.me.num_me; i++) {
-   for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) {
-   for (k = 0; k < adev->gfx.me.num_pipe_per_me; k++) {
-   if (!amdgpu_gfx_is_me_queue_enabled(adev, i, k, 
j))
-   continue;
-
-   r = gfx_v11_0_gfx_ring_init(adev, ring_id,
-   i, k, j);
-   if (r)
-   return r;
-   ring_id++;
+   if (adev->gfx.num_gfx_rings) {
+   ring_id = 0;
+   /* set up the gfx ring */
+   for (i = 0; i < adev->gfx.me.num_me; i++) {
+   for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) {
+   for (k = 0; k < adev->gfx.me.num_pipe_per_me; 
k++) {
+   if 
(!amdgpu_gfx_is_me_queue_enabled(adev, i, k, j))
+   continue;
+
+   r = gfx_v11_0_gfx_ring_init(adev, 
ring_id,
+   i, k, j);
+   if (r)
+   return r;
+   ring_id++;
+   }
}
}
}
 
-   ring_id = 0;
-   /* set up the compute queues - allocate horizontally across pipes */
-   for (i = 0; i < adev->gfx.mec.num_mec; ++i) {
-   for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; j++) {
-   for (k = 0; k < adev->gfx.mec.num_pipe_per_mec; k++) {
-   if (!amdgpu_gfx_is_mec_queue_enabled(adev, 0, i,
-k, j))
-   continue;
+   if (adev->gfx.num_compute_rings) {
+   ring_id = 0;
+   /* set up the compute queues - allocate horizontally across 
pipes */
+   for (i = 0; i < adev->gfx.mec.num_mec; ++i) {
+   for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; j++) {
+   for (k = 0; k < adev->gfx.mec.num_pipe_per_mec; 
k++) {
+   if 
(!amdgpu_gfx_is_mec_queue_enabled(adev, 0, i,
+k, 
j))
+   continue;
 
-   r = gfx_v11_0_compute_ring_init(adev, ring_id,
-   i, k, j);
-   if (r)
-   return r;
+   r = gfx_v11_0_compute_ring_init(adev, 
ring_id,
+   i, k, 
j);
+   if (r)
+   return r;
 
-   ring_id++;
+   ring_id++;
+   }
}
}
}
@@ -4578,11 +4587,23 @@ static int gfx_v11_0_cp_resume(struct amdgpu_device 
*adev)

[PATCH 11/11] drm/amdgpu/sdma7: add support for disable_kq

2025-03-13 Thread Alex Deucher

When the parameter is set, disable user submissions
to kernel queues.

Reviewed-by: Sunil Khatri 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c 
b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
index 92a79296708ae..40d45f738c0a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
@@ -1316,6 +1316,9 @@ static int sdma_v7_0_early_init(struct amdgpu_ip_block 
*ip_block)
struct amdgpu_device *adev = ip_block->adev;
int r;
 
+   if (amdgpu_disable_kq == 1)
+   adev->sdma.no_user_submission = true;
+
r = amdgpu_sdma_init_microcode(adev, 0, true);
if (r) {
DRM_ERROR("Failed to init sdma firmware!\n");
@@ -1351,6 +1354,7 @@ static int sdma_v7_0_sw_init(struct amdgpu_ip_block 
*ip_block)
ring->ring_obj = NULL;
ring->use_doorbell = true;
ring->me = i;
+   ring->no_user_submission = adev->sdma.no_user_submission;
 
DRM_DEBUG("SDMA %d use_doorbell being set to: [%s]\n", i,
ring->use_doorbell?"true":"false");
-- 
2.48.1

[PATCH 03/11] drm/amdgpu/gfx: add generic handling for disable_kq

2025-03-13 Thread Alex Deucher

Add proper checks for disable_kq functionality in
gfx helper functions.  Add special logic for families
that require the clear state setup.

v2: use ring count as per Felix suggestion
v3: fix num_gfx_rings handling in amdgpu_gfx_graphics_queue_acquire()

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c | 8 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h | 2 ++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 984e6ff6e4632..a08243dd0798e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -258,8 +258,9 @@ void amdgpu_gfx_graphics_queue_acquire(struct amdgpu_device 
*adev)
}
 
/* update the number of active graphics rings */
-   adev->gfx.num_gfx_rings =
-   bitmap_weight(adev->gfx.me.queue_bitmap, AMDGPU_MAX_GFX_QUEUES);
+   if (adev->gfx.num_gfx_rings)
+   adev->gfx.num_gfx_rings =
+   bitmap_weight(adev->gfx.me.queue_bitmap, 
AMDGPU_MAX_GFX_QUEUES);
 }
 
 static int amdgpu_gfx_kiq_acquire(struct amdgpu_device *adev,
@@ -1544,6 +1545,9 @@ static ssize_t amdgpu_gfx_set_run_cleaner_shader(struct 
device *dev,
if (adev->in_suspend && !adev->in_runpm)
return -EPERM;
 
+   if (adev->gfx.disable_kq)
+   return -ENOTSUPP;
+
ret = kstrtol(buf, 0, &value);
 
if (ret)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
index ddf4533614bac..8fa68a4ac34f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h
@@ -483,6 +483,8 @@ struct amdgpu_gfx {
 
atomic_ttotal_submission_cnt;
struct delayed_work idle_work;
+
+   booldisable_kq;
 };
 
 struct amdgpu_gfx_ras_reg_entry {
-- 
2.48.1

[PATCH 09/11] drm/amdgpu/sdma: add flag for tracking disable_kq

2025-03-13 Thread Alex Deucher

For SDMA, we still need kernel queues for paging so
they need to be initialized, but we no not want to
accept submissions from userspace when disable_kq
is set.

Reviewed-by: Sunil Khatri 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
index 9651693200655..edc856e10337a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h
@@ -129,6 +129,7 @@ struct amdgpu_sdma {
/* track guilty state of GFX and PAGE queues */
bool gfx_guilty;
bool page_guilty;
+   boolno_user_submission;
 };
 
 /*
-- 
2.48.1

[PATCH V5 00/11] Add disable kernel queue support

2025-03-13 Thread Alex Deucher

To better evaluate user queues, add a module parameter
to disable kernel queues.  With this set kernel queues
are disabled and only user queues are available.  This
frees up hardware resources for use in user queues which
would otherwise be used by kernel queues and provides
a way to validate user queues without the presence
of kernel queues.

v2: use num_gfx_rings and num_compute_rings per
Felix suggestion
v3: include num_gfx_rings fix in amdgpu_gfx.c
v4: additional fixes
v5: MEC EOP interrupt handling fix (Sunil)

Alex Deucher (11):
  drm/amdgpu: add parameter to disable kernel queues
  drm/amdgpu: add ring flag for no user submissions
  drm/amdgpu/gfx: add generic handling for disable_kq
  drm/amdgpu/mes: centralize gfx_hqd mask management
  drm/amdgpu/mes: update hqd masks when disable_kq is set
  drm/amdgpu/mes: make more vmids available when disable_kq=1
  drm/amdgpu/gfx11: add support for disable_kq
  drm/amdgpu/gfx12: add support for disable_kq
  drm/amdgpu/sdma: add flag for tracking disable_kq
  drm/amdgpu/sdma6: add support for disable_kq
  drm/amdgpu/sdma7: add support for disable_kq

 drivers/gpu/drm/amd/amdgpu/amdgpu.h  |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c   |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |   9 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c  |   8 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.h  |   2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  |  30 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c  |  26 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_sdma.h |   1 +
 drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c   | 191 ---
 drivers/gpu/drm/amd/amdgpu/gfx_v12_0.c   | 183 +++---
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c   |   2 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v12_0.c   |   2 +-
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c   |  16 +-
 drivers/gpu/drm/amd/amdgpu/mes_v12_0.c   |  15 +-
 drivers/gpu/drm/amd/amdgpu/sdma_v6_0.c   |   4 +
 drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c   |   4 +
 17 files changed, 345 insertions(+), 155 deletions(-)

-- 
2.48.1

[PATCH 01/11] drm/amdgpu: add parameter to disable kernel queues

2025-03-13 Thread Alex Deucher

On chips that support user queues, setting this option
will disable kernel queues to be used to validate
user queues without kernel queues.

Reviewed-by: Prike Liang 
Reviewed-by: Sunil Khatri 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 9 +
 2 files changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 87062c1adcdf7..45437a8f29d3b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -270,6 +270,7 @@ extern int amdgpu_user_partt_mode;
 extern int amdgpu_agp;
 
 extern int amdgpu_wbrf;
+extern int amdgpu_disable_kq;
 
 #define AMDGPU_VM_MAX_NUM_CTX  4096
 #define AMDGPU_SG_THRESHOLD(256*1024*1024)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index ba869fa99..f50a25fb60376 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -238,6 +238,7 @@ int amdgpu_agp = -1; /* auto */
 int amdgpu_wbrf = -1;
 int amdgpu_damage_clips = -1; /* auto */
 int amdgpu_umsch_mm_fwlog;
+int amdgpu_disable_kq = -1;
 
 DECLARE_DYNDBG_CLASSMAP(drm_debug_classes, DD_CLASS_TYPE_DISJOINT_BITS, 0,
"DRM_UT_CORE",
@@ -1084,6 +1085,14 @@ MODULE_PARM_DESC(wbrf,
"Enable Wifi RFI interference mitigation (0 = disabled, 1 = enabled, -1 
= auto(default)");
 module_param_named(wbrf, amdgpu_wbrf, int, 0444);
 
+/**
+ * DOC: disable_kq (int)
+ * Disable kernel queues on systems that support user queues.
+ * (0 = kernel queues enabled, 1 = kernel queues disabled, -1 = auto (default 
setting))
+ */
+MODULE_PARM_DESC(disable_kq, "Disable kernel queues (-1 = auto (default), 0 = 
enable KQ, 1 = disable KQ)");
+module_param_named(disable_kq, amdgpu_disable_kq, int, 0444);
+
 /* These devices are not supported by amdgpu.
  * They are supported by the mach64, r128, radeon drivers
  */
-- 
2.48.1

[PATCH 04/11] drm/amdgpu/mes: centralize gfx_hqd mask management

2025-03-13 Thread Alex Deucher

Move it to amdgpu_mes to align with the compute and
sdma hqd masks. No functional change.

v2: rebase on new changes

Reviewed-by: Sunil Khatri
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 22 ++
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  | 16 +++-
 drivers/gpu/drm/amd/amdgpu/mes_v12_0.c  | 15 +++
 3 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 6f5e272d7ded3..5abc1ca0fee98 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -108,6 +108,28 @@ int amdgpu_mes_init(struct amdgpu_device *adev)
adev->mes.vmid_mask_mmhub = 0xff00;
adev->mes.vmid_mask_gfxhub = 0xff00;
 
+   for (i = 0; i < AMDGPU_MES_MAX_GFX_PIPES; i++) {
+   /* use only 1st ME pipe */
+   if (i >= adev->gfx.me.num_pipe_per_me)
+   continue;
+   if (amdgpu_ip_version(adev, GC_HWIP, 0) >=
+   IP_VERSION(12, 0, 0))
+   /*
+* GFX V12 has only one GFX pipe, but 8 queues in it.
+* GFX pipe 0 queue 0 is being used by Kernel queue.
+* Set GFX pipe 0 queue 1-7 for MES scheduling
+* mask =  1110b
+*/
+   adev->mes.gfx_hqd_mask[i] = 0xFE;
+   else
+   /*
+* GFX pipe 0 queue 0 is being used by Kernel queue.
+* Set GFX pipe 0 queue 1 for MES scheduling
+* mask = 10b
+*/
+   adev->mes.gfx_hqd_mask[i] = 0x2;
+   }
+
for (i = 0; i < AMDGPU_MES_MAX_COMPUTE_PIPES; i++) {
/* use only 1st MEC pipes */
if (i >= adev->gfx.mec.num_pipe_per_mec)
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index a569d09a1a748..39b45d8b5f049 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -669,18 +669,6 @@ static int mes_v11_0_misc_op(struct amdgpu_mes *mes,
offsetof(union MESAPI__MISC, api_status));
 }
 
-static void mes_v11_0_set_gfx_hqd_mask(union MESAPI_SET_HW_RESOURCES *pkt)
-{
-   /*
-* GFX pipe 0 queue 0 is being used by Kernel queue.
-* Set GFX pipe 0 queue 1 for MES scheduling
-* mask = 10b
-* GFX pipe 1 can't be used for MES due to HW limitation.
-*/
-   pkt->gfx_hqd_mask[0] = 0x2;
-   pkt->gfx_hqd_mask[1] = 0;
-}
-
 static int mes_v11_0_set_hw_resources(struct amdgpu_mes *mes)
 {
int i;
@@ -705,7 +693,9 @@ static int mes_v11_0_set_hw_resources(struct amdgpu_mes 
*mes)
mes_set_hw_res_pkt.compute_hqd_mask[i] =
mes->compute_hqd_mask[i];
 
-   mes_v11_0_set_gfx_hqd_mask(&mes_set_hw_res_pkt);
+   for (i = 0; i < MAX_GFX_PIPES; i++)
+   mes_set_hw_res_pkt.gfx_hqd_mask[i] =
+   mes->gfx_hqd_mask[i];
 
for (i = 0; i < MAX_SDMA_PIPES; i++)
mes_set_hw_res_pkt.sdma_hqd_mask[i] = mes->sdma_hqd_mask[i];
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c 
b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
index 96336652d14c5..519f054bec60d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
@@ -694,17 +694,6 @@ static int mes_v12_0_set_hw_resources_1(struct amdgpu_mes 
*mes, int pipe)
offsetof(union MESAPI_SET_HW_RESOURCES_1, api_status));
 }
 
-static void mes_v12_0_set_gfx_hqd_mask(union MESAPI_SET_HW_RESOURCES *pkt)
-{
-   /*
-* GFX V12 has only one GFX pipe, but 8 queues in it.
-* GFX pipe 0 queue 0 is being used by Kernel queue.
-* Set GFX pipe 0 queue 1-7 for MES scheduling
-* mask =  1110b
-*/
-   pkt->gfx_hqd_mask[0] = 0xFE;
-}
-
 static int mes_v12_0_set_hw_resources(struct amdgpu_mes *mes, int pipe)
 {
int i;
@@ -727,7 +716,9 @@ static int mes_v12_0_set_hw_resources(struct amdgpu_mes 
*mes, int pipe)
mes_set_hw_res_pkt.compute_hqd_mask[i] =
mes->compute_hqd_mask[i];
 
-   mes_v12_0_set_gfx_hqd_mask(&mes_set_hw_res_pkt);
+   for (i = 0; i < MAX_GFX_PIPES; i++)
+   mes_set_hw_res_pkt.gfx_hqd_mask[i] =
+   mes->gfx_hqd_mask[i];
 
for (i = 0; i < MAX_SDMA_PIPES; i++)
mes_set_hw_res_pkt.sdma_hqd_mask[i] =
-- 
2.48.1

Re: [PATCH] drm/amd/display: avoid NPD when ASIC does not support DMUB

2025-03-13 Thread Rodrigo Siqueira

On 03/13, Thadeu Lima de Souza Cascardo wrote:
> On Wed, Feb 05, 2025 at 10:06:38AM -0300, Thadeu Lima de Souza Cascardo wrote:
> > ctx->dmub_srv will de NULL if the ASIC does not support DMUB, which is
> > tested in dm_dmub_sw_init.
> > 
> > However, it will be dereferenced in dmub_hw_lock_mgr_cmd if
> > should_use_dmub_lock returns true.
> > 
> > This has been the case since dmub support has been added for PSR1.
> 
> This bug has landed on stable trees. Any chance for a review here?
> 
> Thanks.
> Cascardo.
> 
> > 
> > Fix this by checking for dmub_srv in should_use_dmub_lock.
> > 
> > [   37.440832] BUG: kernel NULL pointer dereference, address: 
> > 0058
> > [   37.447808] #PF: supervisor read access in kernel mode
> > [   37.452959] #PF: error_code(0x) - not-present page
> > [   37.458112] PGD 0 P4D 0
> > [   37.460662] Oops: Oops:  [#1] PREEMPT SMP NOPTI
> > [   37.465553] CPU: 2 UID: 1000 PID: 1745 Comm: DrmThread Not tainted 
> > 6.14.0-rc1-3-gd62e938120f0 #23 99720e1cb1e0fc4773b8513150932a07de3c6e88
> > [   37.478324] Hardware name: Google Morphius/Morphius, BIOS 
> > Google_Morphius.13434.858.0 10/26/2023
> > [   37.487103] RIP: 0010:dmub_hw_lock_mgr_cmd+0x77/0xb0
> > [   37.492074] Code: 44 24 0e 00 00 00 00 48 c7 04 24 45 00 00 0c 40 88 74 
> > 24 0d 0f b6 02 88 44 24 0c 8b 01 89 44 24 08 85 f6 75 05 c6 44 24 0e 01 
> > <48> 8b 7f 58 48 89 e6 ba 01 00 00 00 e8 08 3c 2a 00 65 48 8b 04 5
> > [   37.510822] RSP: 0018:969442853300 EFLAGS: 00010202
> > [   37.516052] RAX:  RBX: 92db0300 RCX: 
> > 969442853358
> > [   37.523185] RDX: 969442853368 RSI: 0001 RDI: 
> > 
> > [   37.530322] RBP: 0001 R08: 04a7 R09: 
> > 04a5
> > [   37.537453] R10: 0476 R11: 0062 R12: 
> > 92db0ade8000
> > [   37.544589] R13: 92da01180ae0 R14: 92da011802a8 R15: 
> > 92db0300
> > [   37.551725] FS:  784a9cdfc6c0() GS:92db2af0() 
> > knlGS:
> > [   37.559814] CS:  0010 DS:  ES:  CR0: 80050033
> > [   37.565562] CR2: 0058 CR3: 000112b1c000 CR4: 
> > 003506f0
> > [   37.572697] Call Trace:
> > [   37.575152]  
> > [   37.577258]  ? __die_body+0x66/0xb0
> > [   37.580756]  ? page_fault_oops+0x3e7/0x4a0
> > [   37.584861]  ? exc_page_fault+0x3e/0xe0
> > [   37.588706]  ? exc_page_fault+0x5c/0xe0
> > [   37.592550]  ? asm_exc_page_fault+0x22/0x30
> > [   37.596742]  ? dmub_hw_lock_mgr_cmd+0x77/0xb0
> > [   37.601107]  dcn10_cursor_lock+0x1e1/0x240
> > [   37.605211]  program_cursor_attributes+0x81/0x190
> > [   37.609923]  commit_planes_for_stream+0x998/0x1ef0
> > [   37.614722]  update_planes_and_stream_v2+0x41e/0x5c0
> > [   37.619703]  dc_update_planes_and_stream+0x78/0x140
> > [   37.624588]  amdgpu_dm_atomic_commit_tail+0x4362/0x49f0
> > [   37.629832]  ? srso_return_thunk+0x5/0x5f
> > [   37.633847]  ? mark_held_locks+0x6d/0xd0
> > [   37.637774]  ? _raw_spin_unlock_irq+0x24/0x50
> > [   37.642135]  ? srso_return_thunk+0x5/0x5f
> > [   37.646148]  ? lockdep_hardirqs_on+0x95/0x150
> > [   37.650510]  ? srso_return_thunk+0x5/0x5f
> > [   37.654522]  ? _raw_spin_unlock_irq+0x2f/0x50
> > [   37.658883]  ? srso_return_thunk+0x5/0x5f
> > [   37.662897]  ? wait_for_common+0x186/0x1c0
> > [   37.666998]  ? srso_return_thunk+0x5/0x5f
> > [   37.671009]  ? drm_crtc_next_vblank_start+0xc3/0x170
> > [   37.675983]  commit_tail+0xf5/0x1c0
> > [   37.679478]  drm_atomic_helper_commit+0x2a2/0x2b0
> > [   37.684186]  drm_atomic_commit+0xd6/0x100
> > [   37.688199]  ? __cfi___drm_printfn_info+0x10/0x10
> > [   37.692911]  drm_atomic_helper_update_plane+0xe5/0x130
> > [   37.698054]  drm_mode_cursor_common+0x501/0x670
> > [   37.702600]  ? __cfi_drm_mode_cursor_ioctl+0x10/0x10
> > [   37.707572]  drm_mode_cursor_ioctl+0x48/0x70
> > [   37.711851]  drm_ioctl_kernel+0xf2/0x150
> > [   37.715781]  drm_ioctl+0x363/0x590
> > [   37.719189]  ? __cfi_drm_mode_cursor_ioctl+0x10/0x10
> > [   37.724165]  amdgpu_drm_ioctl+0x41/0x80
> > [   37.728013]  __se_sys_ioctl+0x7f/0xd0
> > [   37.731685]  do_syscall_64+0x87/0x100
> > [   37.735355]  ? vma_end_read+0x12/0xe0
> > [   37.739024]  ? srso_return_thunk+0x5/0x5f
> > [   37.743041]  ? find_held_lock+0x47/0xf0
> > [   37.746884]  ? vma_end_read+0x12/0xe0
> > [   37.750552]  ? srso_return_thunk+0x5/0x5f
> > [   37.754565]  ? lock_release+0x1c4/0x2e0
> > [   37.758406]  ? vma_end_read+0x12/0xe0
> > [   37.762079]  ? exc_page_fault+0x84/0xe0
> > [   37.765921]  ? srso_return_thunk+0x5/0x5f
> > [   37.769938]  ? lockdep_hardirqs_on+0x95/0x150
> > [   37.774303]  ? srso_return_thunk+0x5/0x5f
> > [   37.778317]  ? exc_page_fault+0x84/0xe0
> > [   37.782163]  entry_SYSCALL_64_after_hwframe+0x55/0x5d
> > [   37.787218] RIP: 0033:0x784aa5ec3059
> > [   37.790803] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 
> > b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 4

Re: [PATCH] drm/amd/display: avoid NPD when ASIC does not support DMUB

2025-03-13 Thread Alex Deucher

Applied.  Thanks.

Alex

On Thu, Mar 13, 2025 at 11:17 AM Leo Li  wrote:
>
>
>
> On 2025-03-13 07:29, Thadeu Lima de Souza Cascardo wrote:
> > On Wed, Feb 05, 2025 at 10:06:38AM -0300, Thadeu Lima de Souza Cascardo 
> > wrote:
> >> ctx->dmub_srv will de NULL if the ASIC does not support DMUB, which is
> >> tested in dm_dmub_sw_init.
> >>
> >> However, it will be dereferenced in dmub_hw_lock_mgr_cmd if
> >> should_use_dmub_lock returns true.
> >>
> >> This has been the case since dmub support has been added for PSR1.
> >
> > This bug has landed on stable trees. Any chance for a review here?
> >
> > Thanks.
> > Cascardo.
>
> Thanks for the ping and fix!
>
> Reviewed-by: Leo Li 
>
> >
> >>
> >> Fix this by checking for dmub_srv in should_use_dmub_lock.
> >>
> >> [   37.440832] BUG: kernel NULL pointer dereference, address: 
> >> 0058
> >> [   37.447808] #PF: supervisor read access in kernel mode
> >> [   37.452959] #PF: error_code(0x) - not-present page
> >> [   37.458112] PGD 0 P4D 0
> >> [   37.460662] Oops: Oops:  [#1] PREEMPT SMP NOPTI
> >> [   37.465553] CPU: 2 UID: 1000 PID: 1745 Comm: DrmThread Not tainted 
> >> 6.14.0-rc1-3-gd62e938120f0 #23 99720e1cb1e0fc4773b8513150932a07de3c6e88
> >> [   37.478324] Hardware name: Google Morphius/Morphius, BIOS 
> >> Google_Morphius.13434.858.0 10/26/2023
> >> [   37.487103] RIP: 0010:dmub_hw_lock_mgr_cmd+0x77/0xb0
> >> [   37.492074] Code: 44 24 0e 00 00 00 00 48 c7 04 24 45 00 00 0c 40 88 74 
> >> 24 0d 0f b6 02 88 44 24 0c 8b 01 89 44 24 08 85 f6 75 05 c6 44 24 0e 01 
> >> <48> 8b 7f 58 48 89 e6 ba 01 00 00 00 e8 08 3c 2a 00 65 48 8b 04 5
> >> [   37.510822] RSP: 0018:969442853300 EFLAGS: 00010202
> >> [   37.516052] RAX:  RBX: 92db0300 RCX: 
> >> 969442853358
> >> [   37.523185] RDX: 969442853368 RSI: 0001 RDI: 
> >> 
> >> [   37.530322] RBP: 0001 R08: 04a7 R09: 
> >> 04a5
> >> [   37.537453] R10: 0476 R11: 0062 R12: 
> >> 92db0ade8000
> >> [   37.544589] R13: 92da01180ae0 R14: 92da011802a8 R15: 
> >> 92db0300
> >> [   37.551725] FS:  784a9cdfc6c0() GS:92db2af0() 
> >> knlGS:
> >> [   37.559814] CS:  0010 DS:  ES:  CR0: 80050033
> >> [   37.565562] CR2: 0058 CR3: 000112b1c000 CR4: 
> >> 003506f0
> >> [   37.572697] Call Trace:
> >> [   37.575152]  
> >> [   37.577258]  ? __die_body+0x66/0xb0
> >> [   37.580756]  ? page_fault_oops+0x3e7/0x4a0
> >> [   37.584861]  ? exc_page_fault+0x3e/0xe0
> >> [   37.588706]  ? exc_page_fault+0x5c/0xe0
> >> [   37.592550]  ? asm_exc_page_fault+0x22/0x30
> >> [   37.596742]  ? dmub_hw_lock_mgr_cmd+0x77/0xb0
> >> [   37.601107]  dcn10_cursor_lock+0x1e1/0x240
> >> [   37.605211]  program_cursor_attributes+0x81/0x190
> >> [   37.609923]  commit_planes_for_stream+0x998/0x1ef0
> >> [   37.614722]  update_planes_and_stream_v2+0x41e/0x5c0
> >> [   37.619703]  dc_update_planes_and_stream+0x78/0x140
> >> [   37.624588]  amdgpu_dm_atomic_commit_tail+0x4362/0x49f0
> >> [   37.629832]  ? srso_return_thunk+0x5/0x5f
> >> [   37.633847]  ? mark_held_locks+0x6d/0xd0
> >> [   37.637774]  ? _raw_spin_unlock_irq+0x24/0x50
> >> [   37.642135]  ? srso_return_thunk+0x5/0x5f
> >> [   37.646148]  ? lockdep_hardirqs_on+0x95/0x150
> >> [   37.650510]  ? srso_return_thunk+0x5/0x5f
> >> [   37.654522]  ? _raw_spin_unlock_irq+0x2f/0x50
> >> [   37.658883]  ? srso_return_thunk+0x5/0x5f
> >> [   37.662897]  ? wait_for_common+0x186/0x1c0
> >> [   37.666998]  ? srso_return_thunk+0x5/0x5f
> >> [   37.671009]  ? drm_crtc_next_vblank_start+0xc3/0x170
> >> [   37.675983]  commit_tail+0xf5/0x1c0
> >> [   37.679478]  drm_atomic_helper_commit+0x2a2/0x2b0
> >> [   37.684186]  drm_atomic_commit+0xd6/0x100
> >> [   37.688199]  ? __cfi___drm_printfn_info+0x10/0x10
> >> [   37.692911]  drm_atomic_helper_update_plane+0xe5/0x130
> >> [   37.698054]  drm_mode_cursor_common+0x501/0x670
> >> [   37.702600]  ? __cfi_drm_mode_cursor_ioctl+0x10/0x10
> >> [   37.707572]  drm_mode_cursor_ioctl+0x48/0x70
> >> [   37.711851]  drm_ioctl_kernel+0xf2/0x150
> >> [   37.715781]  drm_ioctl+0x363/0x590
> >> [   37.719189]  ? __cfi_drm_mode_cursor_ioctl+0x10/0x10
> >> [   37.724165]  amdgpu_drm_ioctl+0x41/0x80
> >> [   37.728013]  __se_sys_ioctl+0x7f/0xd0
> >> [   37.731685]  do_syscall_64+0x87/0x100
> >> [   37.735355]  ? vma_end_read+0x12/0xe0
> >> [   37.739024]  ? srso_return_thunk+0x5/0x5f
> >> [   37.743041]  ? find_held_lock+0x47/0xf0
> >> [   37.746884]  ? vma_end_read+0x12/0xe0
> >> [   37.750552]  ? srso_return_thunk+0x5/0x5f
> >> [   37.754565]  ? lock_release+0x1c4/0x2e0
> >> [   37.758406]  ? vma_end_read+0x12/0xe0
> >> [   37.762079]  ? exc_page_fault+0x84/0xe0
> >> [   37.765921]  ? srso_return_thunk+0x5/0x5f
> >> [   37.769938]  ? lockdep_hardirqs_on+0x95/0x150
> >> [   37.774303]  ? srso_return_thunk+0x5/0x5f
> >> [   37.778317]  ? exc_p

Re: [RFC PATCH 2/7] drm/amd/display: start using drm_edid helpers to parse EDID caps

2025-03-13 Thread Alex Hung





On 3/8/25 07:26, Melissa Wen wrote:

Groundwork that allocates a temporary drm_edid from raw edid to take
advantage of DRM common-code helpers instead of driver-specific code.

Signed-off-by: Melissa Wen 
---
  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c | 11 +--
  1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
index 2cd35392e2da..7edc23267ee7 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_helpers.c
@@ -108,6 +108,7 @@ enum dc_edid_status dm_helpers_parse_edid_caps(
struct drm_connector *connector = &aconnector->base;
struct drm_device *dev = connector->dev;
struct edid *edid_buf = edid ? (struct edid *) edid->raw_edid : NULL;
+   struct drm_edid *drm_edid;


This declaration generates the below build warning:

drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_helpers.c: In 
function ‘dm_helpers_parse_edid_caps’:
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm_helpers.c:116:18: 
error: assignment discards ‘const’ qualifier from pointer target type 
[-Werror=discarded-qualifiers]
  116 | drm_edid = drm_edid_alloc(edid_buf, EDID_LENGTH * 
(edid_buf->extensions + 1));

  |  ^
cc1: all warnings being treated as errors


It can be fixed as the following change

-   struct drm_edid *drm_edid;
+   const struct drm_edid *drm_edid;



struct cea_sad *sads;
int sad_count = -1;
int sadb_count = -1;
@@ -116,10 +117,13 @@ enum dc_edid_status dm_helpers_parse_edid_caps(
  
  	enum dc_edid_status result = EDID_OK;
  
+

if (!edid_caps || !edid)
return EDID_BAD_INPUT;
  
-	if (!drm_edid_is_valid(edid_buf))

+   drm_edid = drm_edid_alloc(edid_buf, EDID_LENGTH * (edid_buf->extensions 
+ 1));


drm_edid_alloc returns "const struct drm_edid *", so drm_edid needs to 
be const as well.




+
+   if (!drm_edid_valid(drm_edid))
result = EDID_BAD_CHECKSUM;
  
  	edid_caps->manufacturer_id = (uint16_t) edid_buf->mfg_id[0] |

@@ -139,8 +143,10 @@ enum dc_edid_status dm_helpers_parse_edid_caps(
apply_edid_quirks(dev, edid_buf, edid_caps);
  
  	sad_count = drm_edid_to_sad((struct edid *) edid->raw_edid, &sads);

-   if (sad_count <= 0)
+   if (sad_count <= 0) {
+   drm_edid_free(drm_edid);
return result;
+   }
  
  	edid_caps->audio_mode_count = min(sad_count, DC_MAX_AUDIO_DESC_COUNT);

for (i = 0; i < edid_caps->audio_mode_count; ++i) {
@@ -166,6 +172,7 @@ enum dc_edid_status dm_helpers_parse_edid_caps(
  
  	kfree(sads);

kfree(sadb);
+   drm_edid_free(drm_edid);
  
  	return result;

  }

Re: commit 7ffb791423c7 breaks steam game

2025-03-13 Thread Balbir Singh

On 3/14/25 05:12, Bert Karwatzki wrote:
> Am Donnerstag, dem 13.03.2025 um 22:47 +1100 schrieb Balbir Singh:
>>
>>
>> Anyway, I think the nokaslr result is interesting, it seems like with nokaslr
>> even the older kernels have problems with the game
>>
>> Could you confirm if with nokaslr
>>
> Now I've tested kernel 6.8.12 with nokaslr
> 
>> 1. Only one single game stellaris is not working?
>> 2. The entire laptop does not work?
>> 3. Laptop works and other games work? Just one game is not working as
> expected?
> 
> 
> Stellaris is showing the input lag and the entire graphical user interface 
> shows
> the same input lag as long as stellaris is running.
> Civilization 6 shows the same input lag as stellaris, probably even worse.
> Magic the Gathering: Arena (with wine) works normally.
> Valheim also works normally.
> Crusader Kings 2 works normally
> Rogue Heroes: Ruins of Tasos (a Zelda lookalike) works normally.
> Baldur's Gate I & II and Icewind Dale work normally.
> 
> Also the input lag is only in the GUI, if I switch to a text console (ctrl + 
> alt
> + Fn), input works normally even while the affected games are running.
> 
> Games aside everything else (e.g. compiling kernels) seems to work with 
> nokaslr.
> 

Would it be fair to assume that anything Xorg/Wayland is working fine
when the game is not running, even with nokaslr?

+amd-gfx@lists.freedesktop.org to see if there are known issues with
nokaslr and the games you mentioned.


Balbir Singh

PS: I came across an interesting link
https://www.alex-ionescu.com/behind-windows-x64s-44-bit-memory-addressing-limit/

I think SLIST_HEADER is used by wine as well for user space and I am not sure
if in this situation the game is hitting this scenario, but surprisingly the 
other
games are not. This is assuming the game uses wine. I am not sure it's related,
but the 44 bits caught my attention.

Re: [PATCH 07/11] drm/amdgpu/gfx11: add support for disable_kq

2025-03-13 Thread Rodrigo Siqueira

On 03/13, Alex Deucher wrote:
> Plumb in support for disabling kernel queues in
> GFX11.  We have to bring up a GFX queue briefly in
> order to initialize the clear state.  After that
> we can disable it.
> 
> v2: use ring counts per Felix' suggestion
> v3: fix stream fault handler, enable EOP interrupts
> v4: fix MEC interrupt offset (Sunil)
> 
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 191 ++---
>  1 file changed, 136 insertions(+), 55 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> index 95eefd9a40d28..fde8464cbd3b3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> @@ -1145,6 +1145,10 @@ static int gfx_v11_0_gfx_ring_init(struct 
> amdgpu_device *adev, int ring_id,
>  
>   ring->ring_obj = NULL;
>   ring->use_doorbell = true;
> + if (adev->gfx.disable_kq) {
> + ring->no_scheduler = true;

Hi Alex,

Just a question about this no_scheduler part.

Set no_scheduler to true, means that all of the queues of GFX11 will not
be preempted, right? I suppose you have to do it because you want to
initialize the clear state?

Thanks

> + ring->no_user_submission = true;
> + }
>  
>   if (!ring_id)
>   ring->doorbell_index = adev->doorbell_index.gfx_ring0 << 1;
> @@ -1577,7 +1581,7 @@ static void gfx_v11_0_alloc_ip_dump(struct 
> amdgpu_device *adev)
>  
>  static int gfx_v11_0_sw_init(struct amdgpu_ip_block *ip_block)
>  {
> - int i, j, k, r, ring_id = 0;
> + int i, j, k, r, ring_id;
>   int xcc_id = 0;
>   struct amdgpu_device *adev = ip_block->adev;
>  
> @@ -1710,37 +1714,42 @@ static int gfx_v11_0_sw_init(struct amdgpu_ip_block 
> *ip_block)
>   return r;
>   }
>  
> - /* set up the gfx ring */
> - for (i = 0; i < adev->gfx.me.num_me; i++) {
> - for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) {
> - for (k = 0; k < adev->gfx.me.num_pipe_per_me; k++) {
> - if (!amdgpu_gfx_is_me_queue_enabled(adev, i, k, 
> j))
> - continue;
> -
> - r = gfx_v11_0_gfx_ring_init(adev, ring_id,
> - i, k, j);
> - if (r)
> - return r;
> - ring_id++;
> + if (adev->gfx.num_gfx_rings) {
> + ring_id = 0;
> + /* set up the gfx ring */
> + for (i = 0; i < adev->gfx.me.num_me; i++) {
> + for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) {
> + for (k = 0; k < adev->gfx.me.num_pipe_per_me; 
> k++) {
> + if 
> (!amdgpu_gfx_is_me_queue_enabled(adev, i, k, j))
> + continue;
> +
> + r = gfx_v11_0_gfx_ring_init(adev, 
> ring_id,
> + i, k, j);
> + if (r)
> + return r;
> + ring_id++;
> + }
>   }
>   }
>   }
>  
> - ring_id = 0;
> - /* set up the compute queues - allocate horizontally across pipes */
> - for (i = 0; i < adev->gfx.mec.num_mec; ++i) {
> - for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; j++) {
> - for (k = 0; k < adev->gfx.mec.num_pipe_per_mec; k++) {
> - if (!amdgpu_gfx_is_mec_queue_enabled(adev, 0, i,
> -  k, j))
> - continue;
> + if (adev->gfx.num_compute_rings) {
> + ring_id = 0;
> + /* set up the compute queues - allocate horizontally across 
> pipes */
> + for (i = 0; i < adev->gfx.mec.num_mec; ++i) {
> + for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; j++) {
> + for (k = 0; k < adev->gfx.mec.num_pipe_per_mec; 
> k++) {
> + if 
> (!amdgpu_gfx_is_mec_queue_enabled(adev, 0, i,
> +  k, 
> j))
> + continue;
>  
> - r = gfx_v11_0_compute_ring_init(adev, ring_id,
> - i, k, j);
> - if (r)
> - return r;
> + r = gfx_v11_0_compute_ring_init(adev, 
> ring_id,
> + i, k, 
> j);
> +

Re: [PATCH 11/11] drm/amdgpu/sdma7: add support for disable_kq

2025-03-13 Thread Rodrigo Siqueira

On 03/13, Alex Deucher wrote:
> When the parameter is set, disable user submissions
> to kernel queues.
> 
> Reviewed-by: Sunil Khatri 
> Signed-off-by: Alex Deucher 
> ---
>  drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c 
> b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> index 92a79296708ae..40d45f738c0a8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c
> @@ -1316,6 +1316,9 @@ static int sdma_v7_0_early_init(struct amdgpu_ip_block 
> *ip_block)
>   struct amdgpu_device *adev = ip_block->adev;
>   int r;
>  
> + if (amdgpu_disable_kq == 1)
> + adev->sdma.no_user_submission = true;
> +
>   r = amdgpu_sdma_init_microcode(adev, 0, true);
>   if (r) {
>   DRM_ERROR("Failed to init sdma firmware!\n");
> @@ -1351,6 +1354,7 @@ static int sdma_v7_0_sw_init(struct amdgpu_ip_block 
> *ip_block)
>   ring->ring_obj = NULL;
>   ring->use_doorbell = true;
>   ring->me = i;
> + ring->no_user_submission = adev->sdma.no_user_submission;
>  
>   DRM_DEBUG("SDMA %d use_doorbell being set to: [%s]\n", i,
>   ring->use_doorbell?"true":"false");
> -- 
> 2.48.1
> 

Hi Alex,

I think patch 9-11 could be a squashed in a single one.

Thanks

-- 
Rodrigo Siqueira

Re: [RFC PATCH 0/7] drm/amd/display: more DRM edid common-code to the display driver

2025-03-13 Thread Alex Hung

The series look fine to me, except one small error in patch 2. I can 
send this series to promotion tests once the error is addressed. Let me 
also check others for comments.


Hi Harry and Leo,

Do you have other concerns before I sent this series to promotion tests?


On 3/8/25 07:26, Melissa Wen wrote:

Hi,

I've been working on a new version of [1] that ports the AMD display
driver to use the common `drm_edid` code instead of open and raw edid
handling.

The part of the previous series that didn't make the cut was to replace
the open coded edid parser with `drm_edid_product_id` and `eld` data.
However, when addressing feedback I ran into a design issue in the
latest version because I was trying to not add any Linux-specific code
to the DC code, specifically, DC link detection. In short, we have a few
paths in the DM code that allocate a temporary `drm_edid`, go to the DC,
and back to the DM to handle `drm_edid` data. In the last version, I was
storing this temporary `drm_edid` in the aconnector, but it was
erroneously overriding a still in use `drm_edid`.

In this new version I allocate a temporary `drm_edid` in the DM parser
from raw edid data stored in `dc_edid`, which was actually extracted
from another `drm_edid` in the previous DM caller. This is a workaround
to bypass DC boundaries and parse edid capabilities, but I think we can
do better if we can find a clean way to pass the `drm_edid` through this
kind of DM -> DC -> DM paths.

I checked working on top of Thomas' work[2] that replaces `dc_edid` by
`drm_edid` and adds this DRM struct and its helpers inside DC. The
resulted changes work stable and as expected[3], but I believe that
adding linux-specific code to DC is not desirable.

Siqueira and I have discussed other alternatives, such as updating the
DC code to match `drm_edid` structs or checking how well the change in
this series could work with `drm_edid` as a private obj[4], however we
would like to know AMD team's opinion before making this big effort (and
probably noisy change). The main goal here is removing `drm_edid_raw`
calls and duplicated code to handle edid in DC/link_detection that can
take advantage of DRM common-code instead.

Please, let me know your thoughts.

Melissa

[1] 
https://lore.kernel.org/amd-gfx/20240918213845.158293-1-mario.limoncie...@amd.com/
[2] 
https://lore.kernel.org/amd-gfx/20241112-amdgpu-drm_edid-v2-0-1399dc0f0...@weissschuh.net/
[3] 
https://gitlab.freedesktop.org/mwen/linux-amd/-/commits/drm_edid_product_id_v4
[4] https://gitlab.freedesktop.org/mwen/linux-amd/-/commits/drm_edid_priv

Melissa Wen (7):
   drm/amd/display: make sure drm_edid stored in aconnector doesn't leak
   drm/amd/display: start using drm_edid helpers to parse EDID caps
   drm/amd/display: use drm_edid_product_id for parsing EDID product info
   drm/amd/display: parse display name from drm_eld
   drm/amd/display: get panel id with drm_edid helper
   drm/amd/display: get SAD from drm_eld when parsing EDID caps
   drm/amd/display: get SADB from drm_eld when parsing EDID caps

  .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +
  .../amd/display/amdgpu_dm/amdgpu_dm_helpers.c | 85 +--
  2 files changed, 42 insertions(+), 45 deletions(-)

Re: [PATCH] drm/amdgpu: Add debug masks for HDCP LC FW testing

2025-03-13 Thread Pillai, Aurabindo

[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Aurabindo Pillai 


--

Regards,
Jay

From: Dominik Kaszewski 
Sent: Thursday, March 13, 2025 4:52 AM
To: amd-gfx@lists.freedesktop.org 
Cc: Pillai, Aurabindo 
Subject: [PATCH] drm/amdgpu: Add debug masks for HDCP LC FW testing

HDCP Locality Check is being moved to FW, add debug flags to control
its behavior in existing hardware for validation purposes.

Signed-off-by: Dominik Kaszewski 
---
 drivers/gpu/drm/amd/include/amd_shared.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/amd/include/amd_shared.h 
b/drivers/gpu/drm/amd/include/amd_shared.h
index 485b713cfad0..4c95b885d1d0 100644
--- a/drivers/gpu/drm/amd/include/amd_shared.h
+++ b/drivers/gpu/drm/amd/include/amd_shared.h
@@ -358,6 +358,18 @@ enum DC_DEBUG_MASK {
  * @DC_DISABLE_CUSTOM_BRIGHTNESS_CURVE: If set, disable support for 
custom brightness curves
  */
 DC_DISABLE_CUSTOM_BRIGHTNESS_CURVE = 0x4,
+
+   /**
+* @DC_HDCP_LC_FORCE_FW_ENABLE: If set, use HDCP Locality Check FW
+* path regardless of reported HW capabilities.
+*/
+   DC_HDCP_LC_FORCE_FW_ENABLE = 0x8,
+
+   /**
+* @DC_HDCP_LC_ENABLE_SW_FALLBACK If set, upon HDCP Locality Check FW
+* path failure, retry using legacy SW path.
+*/
+   DC_HDCP_LC_ENABLE_SW_FALLBACK = 0x10,
 };

 enum amd_dpm_forced_level;
--
2.43.0

[PATCH] drm/amdgpu: drop drm_firmware_drivers_only()

2025-03-13 Thread Alex Deucher

There are a number of systems and cloud providers out there
that have nomodeset hardcoded in their kernel parameters
to block nouveau for the nvidia driver.  This prevents the
amdgpu driver from loading. Unfortunately the end user cannot
easily change this.  The preferred way to block modules from
loading is to use modprobe.blacklist=.  That is what
providers should be using to block specific drivers.

Drop the check to allow the driver to load even when nomodeset
is specified on the kernel command line.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 3b34fdd105937..dd86661153582 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -3022,9 +3022,6 @@ static int __init amdgpu_init(void)
 {
int r;
 
-   if (drm_firmware_drivers_only())
-   return -EINVAL;
-
r = amdgpu_sync_init();
if (r)
goto error_sync;
-- 
2.48.1

Re: [PATCH V5 00/11] Add disable kernel queue support

2025-03-13 Thread Rodrigo Siqueira

n 03/13, Alex Deucher wrote:
> To better evaluate user queues, add a module parameter
> to disable kernel queues.  With this set kernel queues
> are disabled and only user queues are available.  This
> frees up hardware resources for use in user queues which
> would otherwise be used by kernel queues and provides
> a way to validate user queues without the presence
> of kernel queues.

Hi Alex,

I'm trying to understand how GFX and MES deal with different queues, and
I used this patchset to guide me through that. In this sense, could you
help me with the following points?

FWIU, the GFX has what are called pipes, which in turn have hardware
queues associated with them. For example, a GFX can have 2 pipes, and
each pipe could have 2 hardware queues; or it could have 1 pipe and 8
queue. Is this correct? 

(for this next part, suppose 1 pipe 2 hardware queues)
By default, one of the hardware queues is reserved for the Kernel Queue,
and the user space could use the other. GFX has the MES block "connected"
to all pipe queues, and MES is responsible for scheduling different ring
buffers (in memory) in the pipe's hardware queue (effectively making the
ring active). However, since the kernel queue is always present, MES
only performs scheduling in one of the hardware queues. This scheduling
occurs with the MES mapping and unmapping available Rings in memory to
the hardware queue.

Does the above description sound correct to you?  How about the below
diagram? Does it look correct to you?

(I hope the diagram looks fine in your email client; if not, I can
attach a picture of it.)

+---+
|   GFX 
|
|   
|
|   
+-+ |
|   +-+ (Hw Queue 0)
| Kernel Queue (No eviction)  +--- No MES Scheduling|
|   |(Hardware Queue 0)   | 
--->| |   | 
|
|PIPE 0 |   - | 
+-+   X |
|   |(Hardware Queue 1)   | 
   +--+-+   |
|   |   - |--+  
   ||   |
|   | |  |  
++ ||   |
|   +-+  | (Hw Queue 1) 
|| |   MES Schedules|   |
|+> 
|  User Queue+-+|   |
|   
|| ||   |
|   
++ ||   |
|   
   ++   |
|   
 |  |
|   
   +-+  |
|   
   |Un/Map Ring |
|   
   ||
+---+

   | 

+-++
 |   MEMORY 
   v|
 |

[PATCH 04/11] drm/amdgpu/mes: centralize gfx_hqd mask management

2025-03-13 Thread Alex Deucher

Move it to amdgpu_mes to align with the compute and
sdma hqd masks. No functional change.

v2: rebase on new changes

Reviewed-by: Sunil Khatri
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c | 22 ++
 drivers/gpu/drm/amd/amdgpu/mes_v11_0.c  | 16 +++-
 drivers/gpu/drm/amd/amdgpu/mes_v12_0.c  | 15 +++
 3 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
index 6f5e272d7ded3..5abc1ca0fee98 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mes.c
@@ -108,6 +108,28 @@ int amdgpu_mes_init(struct amdgpu_device *adev)
adev->mes.vmid_mask_mmhub = 0xff00;
adev->mes.vmid_mask_gfxhub = 0xff00;
 
+   for (i = 0; i < AMDGPU_MES_MAX_GFX_PIPES; i++) {
+   /* use only 1st ME pipe */
+   if (i >= adev->gfx.me.num_pipe_per_me)
+   continue;
+   if (amdgpu_ip_version(adev, GC_HWIP, 0) >=
+   IP_VERSION(12, 0, 0))
+   /*
+* GFX V12 has only one GFX pipe, but 8 queues in it.
+* GFX pipe 0 queue 0 is being used by Kernel queue.
+* Set GFX pipe 0 queue 1-7 for MES scheduling
+* mask =  1110b
+*/
+   adev->mes.gfx_hqd_mask[i] = 0xFE;
+   else
+   /*
+* GFX pipe 0 queue 0 is being used by Kernel queue.
+* Set GFX pipe 0 queue 1 for MES scheduling
+* mask = 10b
+*/
+   adev->mes.gfx_hqd_mask[i] = 0x2;
+   }
+
for (i = 0; i < AMDGPU_MES_MAX_COMPUTE_PIPES; i++) {
/* use only 1st MEC pipes */
if (i >= adev->gfx.mec.num_pipe_per_mec)
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c 
b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
index a569d09a1a748..39b45d8b5f049 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c
@@ -669,18 +669,6 @@ static int mes_v11_0_misc_op(struct amdgpu_mes *mes,
offsetof(union MESAPI__MISC, api_status));
 }
 
-static void mes_v11_0_set_gfx_hqd_mask(union MESAPI_SET_HW_RESOURCES *pkt)
-{
-   /*
-* GFX pipe 0 queue 0 is being used by Kernel queue.
-* Set GFX pipe 0 queue 1 for MES scheduling
-* mask = 10b
-* GFX pipe 1 can't be used for MES due to HW limitation.
-*/
-   pkt->gfx_hqd_mask[0] = 0x2;
-   pkt->gfx_hqd_mask[1] = 0;
-}
-
 static int mes_v11_0_set_hw_resources(struct amdgpu_mes *mes)
 {
int i;
@@ -705,7 +693,9 @@ static int mes_v11_0_set_hw_resources(struct amdgpu_mes 
*mes)
mes_set_hw_res_pkt.compute_hqd_mask[i] =
mes->compute_hqd_mask[i];
 
-   mes_v11_0_set_gfx_hqd_mask(&mes_set_hw_res_pkt);
+   for (i = 0; i < MAX_GFX_PIPES; i++)
+   mes_set_hw_res_pkt.gfx_hqd_mask[i] =
+   mes->gfx_hqd_mask[i];
 
for (i = 0; i < MAX_SDMA_PIPES; i++)
mes_set_hw_res_pkt.sdma_hqd_mask[i] = mes->sdma_hqd_mask[i];
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c 
b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
index 96336652d14c5..519f054bec60d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
@@ -694,17 +694,6 @@ static int mes_v12_0_set_hw_resources_1(struct amdgpu_mes 
*mes, int pipe)
offsetof(union MESAPI_SET_HW_RESOURCES_1, api_status));
 }
 
-static void mes_v12_0_set_gfx_hqd_mask(union MESAPI_SET_HW_RESOURCES *pkt)
-{
-   /*
-* GFX V12 has only one GFX pipe, but 8 queues in it.
-* GFX pipe 0 queue 0 is being used by Kernel queue.
-* Set GFX pipe 0 queue 1-7 for MES scheduling
-* mask =  1110b
-*/
-   pkt->gfx_hqd_mask[0] = 0xFE;
-}
-
 static int mes_v12_0_set_hw_resources(struct amdgpu_mes *mes, int pipe)
 {
int i;
@@ -727,7 +716,9 @@ static int mes_v12_0_set_hw_resources(struct amdgpu_mes 
*mes, int pipe)
mes_set_hw_res_pkt.compute_hqd_mask[i] =
mes->compute_hqd_mask[i];
 
-   mes_v12_0_set_gfx_hqd_mask(&mes_set_hw_res_pkt);
+   for (i = 0; i < MAX_GFX_PIPES; i++)
+   mes_set_hw_res_pkt.gfx_hqd_mask[i] =
+   mes->gfx_hqd_mask[i];
 
for (i = 0; i < MAX_SDMA_PIPES; i++)
mes_set_hw_res_pkt.sdma_hqd_mask[i] =
-- 
2.48.1

Re: [PATCH 07/11] drm/amdgpu/gfx11: add support for disable_kq

2025-03-13 Thread Alex Deucher

On Thu, Mar 13, 2025 at 6:08 PM Rodrigo Siqueira  wrote:
>
> On 03/13, Alex Deucher wrote:
> > Plumb in support for disabling kernel queues in
> > GFX11.  We have to bring up a GFX queue briefly in
> > order to initialize the clear state.  After that
> > we can disable it.
> >
> > v2: use ring counts per Felix' suggestion
> > v3: fix stream fault handler, enable EOP interrupts
> > v4: fix MEC interrupt offset (Sunil)
> >
> > Signed-off-by: Alex Deucher 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c | 191 ++---
> >  1 file changed, 136 insertions(+), 55 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c 
> > b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> > index 95eefd9a40d28..fde8464cbd3b3 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v11_0.c
> > @@ -1145,6 +1145,10 @@ static int gfx_v11_0_gfx_ring_init(struct 
> > amdgpu_device *adev, int ring_id,
> >
> >   ring->ring_obj = NULL;
> >   ring->use_doorbell = true;
> > + if (adev->gfx.disable_kq) {
> > + ring->no_scheduler = true;
>
> Hi Alex,
>
> Just a question about this no_scheduler part.
>
> Set no_scheduler to true, means that all of the queues of GFX11 will not
> be preempted, right? I suppose you have to do it because you want to
> initialize the clear state?

Not exactly.  We just spin up a gfx queue long enough to submit the
clear state setup and then we tear it down so its queue slot is
available for user queues.  So it's not actually a usable kernel queue
at runtime.  Setting the no_scheduler flag prevents a drm scheduler
from being initialized for the queue.

Alex

>
> Thanks
>
> > + ring->no_user_submission = true;
> > + }
> >
> >   if (!ring_id)
> >   ring->doorbell_index = adev->doorbell_index.gfx_ring0 << 1;
> > @@ -1577,7 +1581,7 @@ static void gfx_v11_0_alloc_ip_dump(struct 
> > amdgpu_device *adev)
> >
> >  static int gfx_v11_0_sw_init(struct amdgpu_ip_block *ip_block)
> >  {
> > - int i, j, k, r, ring_id = 0;
> > + int i, j, k, r, ring_id;
> >   int xcc_id = 0;
> >   struct amdgpu_device *adev = ip_block->adev;
> >
> > @@ -1710,37 +1714,42 @@ static int gfx_v11_0_sw_init(struct amdgpu_ip_block 
> > *ip_block)
> >   return r;
> >   }
> >
> > - /* set up the gfx ring */
> > - for (i = 0; i < adev->gfx.me.num_me; i++) {
> > - for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) {
> > - for (k = 0; k < adev->gfx.me.num_pipe_per_me; k++) {
> > - if (!amdgpu_gfx_is_me_queue_enabled(adev, i, 
> > k, j))
> > - continue;
> > -
> > - r = gfx_v11_0_gfx_ring_init(adev, ring_id,
> > - i, k, j);
> > - if (r)
> > - return r;
> > - ring_id++;
> > + if (adev->gfx.num_gfx_rings) {
> > + ring_id = 0;
> > + /* set up the gfx ring */
> > + for (i = 0; i < adev->gfx.me.num_me; i++) {
> > + for (j = 0; j < adev->gfx.me.num_queue_per_pipe; j++) 
> > {
> > + for (k = 0; k < adev->gfx.me.num_pipe_per_me; 
> > k++) {
> > + if 
> > (!amdgpu_gfx_is_me_queue_enabled(adev, i, k, j))
> > + continue;
> > +
> > + r = gfx_v11_0_gfx_ring_init(adev, 
> > ring_id,
> > + i, k, j);
> > + if (r)
> > + return r;
> > + ring_id++;
> > + }
> >   }
> >   }
> >   }
> >
> > - ring_id = 0;
> > - /* set up the compute queues - allocate horizontally across pipes */
> > - for (i = 0; i < adev->gfx.mec.num_mec; ++i) {
> > - for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; j++) {
> > - for (k = 0; k < adev->gfx.mec.num_pipe_per_mec; k++) {
> > - if (!amdgpu_gfx_is_mec_queue_enabled(adev, 0, 
> > i,
> > -  k, j))
> > - continue;
> > + if (adev->gfx.num_compute_rings) {
> > + ring_id = 0;
> > + /* set up the compute queues - allocate horizontally across 
> > pipes */
> > + for (i = 0; i < adev->gfx.mec.num_mec; ++i) {
> > + for (j = 0; j < adev->gfx.mec.num_queue_per_pipe; 
> > j++) {
> > + for (k = 0; k < 
> > adev->gfx.mec.num_pipe_per_mec; k++) {
> > + if 
> > (!amdgpu_gfx_is_mec_queue_enabled(adev, 0, i

Re: [PATCH 02/11] drm/amdgpu: add ring flag for no user submissions

2025-03-13 Thread Alex Deucher

On Thu, Mar 13, 2025 at 5:53 PM Rodrigo Siqueira  wrote:
>
> On 03/13, Alex Deucher wrote:
> > This would be set by IPs which only accept submissions
> > from the kernel, not userspace, such as when kernel
> > queues are disabled. Don't expose the rings to userspace
> > and reject any submissions in the CS IOCTL.
>
> Just out of curiosity, is CS == Command Submission?

Yes.

>
> >
> > Reviewed-by: Sunil Khatri
> > Signed-off-by: Alex Deucher 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c   |  4 
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  | 30 
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  2 +-
> >  3 files changed, 25 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > index 5df21529b3b13..5cc18034b75df 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> > @@ -349,6 +349,10 @@ static int amdgpu_cs_p2_ib(struct amdgpu_cs_parser *p,
> >   ring = amdgpu_job_ring(job);
> >   ib = &job->ibs[job->num_ibs++];
> >
> > + /* submissions to kernel queus are disabled */
>
> /queus/queues/

will fix.

>
> > + if (ring->no_user_submission)
> > + return -EINVAL;
>
> Since this will be set for ASICs that don't accept userspace
> submissions, maybe -ENOTSUPP would be more accurate?

I could go either way on that.  When kernel queues are disabled, CS
submissions to a kernel queue are not supported, but it's also an
invalid parameter to the IOCTL.  I don't have a strong opinion.  Maybe
-ENOTSUPP would differentiate it from just general invalid parameters.

Alex

>
> Thanks
>
> > +
> >   /* MM engine doesn't support user fences */
> >   if (p->uf_bo && ring->funcs->no_user_fence)
> >   return -EINVAL;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > index cd6eb7a3bc58a..3b7dfd56ccd0e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> > @@ -408,7 +408,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
> >   case AMDGPU_HW_IP_GFX:
> >   type = AMD_IP_BLOCK_TYPE_GFX;
> >   for (i = 0; i < adev->gfx.num_gfx_rings; i++)
> > - if (adev->gfx.gfx_ring[i].sched.ready)
> > + if (adev->gfx.gfx_ring[i].sched.ready &&
> > + !adev->gfx.gfx_ring[i].no_user_submission)
> >   ++num_rings;
> >   ib_start_alignment = 32;
> >   ib_size_alignment = 32;
> > @@ -416,7 +417,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
> >   case AMDGPU_HW_IP_COMPUTE:
> >   type = AMD_IP_BLOCK_TYPE_GFX;
> >   for (i = 0; i < adev->gfx.num_compute_rings; i++)
> > - if (adev->gfx.compute_ring[i].sched.ready)
> > + if (adev->gfx.compute_ring[i].sched.ready &&
> > + !adev->gfx.compute_ring[i].no_user_submission)
> >   ++num_rings;
> >   ib_start_alignment = 32;
> >   ib_size_alignment = 32;
> > @@ -424,7 +426,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
> >   case AMDGPU_HW_IP_DMA:
> >   type = AMD_IP_BLOCK_TYPE_SDMA;
> >   for (i = 0; i < adev->sdma.num_instances; i++)
> > - if (adev->sdma.instance[i].ring.sched.ready)
> > + if (adev->sdma.instance[i].ring.sched.ready &&
> > + !adev->gfx.gfx_ring[i].no_user_submission)
> >   ++num_rings;
> >   ib_start_alignment = 256;
> >   ib_size_alignment = 4;
> > @@ -435,7 +438,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
> >   if (adev->uvd.harvest_config & (1 << i))
> >   continue;
> >
> > - if (adev->uvd.inst[i].ring.sched.ready)
> > + if (adev->uvd.inst[i].ring.sched.ready &&
> > + !adev->uvd.inst[i].ring.no_user_submission)
> >   ++num_rings;
> >   }
> >   ib_start_alignment = 256;
> > @@ -444,7 +448,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
> >   case AMDGPU_HW_IP_VCE:
> >   type = AMD_IP_BLOCK_TYPE_VCE;
> >   for (i = 0; i < adev->vce.num_rings; i++)
> > - if (adev->vce.ring[i].sched.ready)
> > + if (adev->vce.ring[i].sched.ready &&
> > + !adev->vce.ring[i].no_user_submission)
> >   ++num_rings;
> >   ib_start_alignment = 256;
> >   ib_size_alignment = 4;
> > @@ -456,7 +461,8 @@ static int amdgpu_hw_ip_info(struct amdgpu_device *adev,
> >   continue

[PATCH 3/3] drm/amdgpu: don't free conflicting apertures for non-display devices

2025-03-13 Thread Alex Deucher

PCI_CLASS_ACCELERATOR_PROCESSING devices won't ever be
the sysfb, so there is no need to free conflicting
apertures.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 1eff6252f66b4..689f1833d02b7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4401,10 +4401,17 @@ int amdgpu_device_init(struct amdgpu_device *adev,
if (r)
return r;
 
-   /* Get rid of things like offb */
-   r = aperture_remove_conflicting_pci_devices(adev->pdev, 
amdgpu_kms_driver.name);
-   if (r)
-   return r;
+   /*
+* No need to remove conflicting FBs for non-display class devices.
+* This prevents the sysfb from being freed accidently.
+*/
+   if ((pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA ||
+   (pdev->class >> 8) == PCI_CLASS_DISPLAY_OTHER) {
+   /* Get rid of things like offb */
+   r = aperture_remove_conflicting_pci_devices(adev->pdev, 
amdgpu_kms_driver.name);
+   if (r)
+   return r;
+   }
 
/* Enable TMZ based on IP_VERSION */
amdgpu_gmc_tmz_set(adev);
-- 
2.48.1

[PATCH 1/3] drm/amdgpu: drop drm_firmware_drivers_only()

2025-03-13 Thread Alex Deucher

There are a number of systems and cloud providers out there
that have nomodeset hardcoded in their kernel parameters
to block nouveau for the nvidia driver.  This prevents the
amdgpu driver from loading. Unfortunately the end user cannot
easily change this.  The preferred way to block modules from
loading is to use modprobe.blacklist=.  That is what
providers should be using to block specific drivers.

Drop the check to allow the driver to load even when nomodeset
is specified on the kernel command line.

Reviewed-by: Kent Russell 
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 3b34fdd105937..dd86661153582 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -3022,9 +3022,6 @@ static int __init amdgpu_init(void)
 {
int r;
 
-   if (drm_firmware_drivers_only())
-   return -EINVAL;
-
r = amdgpu_sync_init();
if (r)
goto error_sync;
-- 
2.48.1

[PATCH 2/3] drm/amdgpu: adjust drm_firmware_drivers_only() handling

2025-03-13 Thread Alex Deucher

Move to probe so we can check the PCI device type and
only apply the drm_firmware_drivers_only() check for
PCI DISPLAY classes.  Also add a module parameter to
override the nomodeset kernel parameter as a workaround
for platforms that have this hardcoded on their kernel
command lines.

Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index dd86661153582..4e1a6a249bba5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -178,6 +178,7 @@ uint amdgpu_sdma_phase_quantum = 32;
 char *amdgpu_disable_cu;
 char *amdgpu_virtual_display;
 bool enforce_isolation;
+int amdgpu_modeset = -1;
 
 /* Specifies the default granularity for SVM, used in buffer
  * migration and restoration of backing memory when handling
@@ -1040,6 +1041,13 @@ module_param_named(user_partt_mode, 
amdgpu_user_partt_mode, uint, 0444);
 module_param(enforce_isolation, bool, 0444);
 MODULE_PARM_DESC(enforce_isolation, "enforce process isolation between 
graphics and compute . enforce_isolation = on");
 
+/**
+ * DOC: modeset (int)
+ * Override nomodeset (1 = override, -1 = auto). The default is -1 (auto).
+ */
+MODULE_PARM_DESC(modeset, "Override nomodeset (1 = enable, -1 = auto)");
+module_param_named(modeset, amdgpu_modeset, int, 0444);
+
 /**
  * DOC: seamless (int)
  * Seamless boot will keep the image on the screen during the boot process.
@@ -2270,6 +2278,12 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
int ret, retry = 0, i;
bool supports_atomic = false;
 
+   if ((pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA ||
+   (pdev->class >> 8) == PCI_CLASS_DISPLAY_OTHER) {
+   if (drm_firmware_drivers_only() && amdgpu_modeset == -1)
+   return -EINVAL;
+   }
+
/* skip devices which are owned by radeon */
for (i = 0; i < ARRAY_SIZE(amdgpu_unsupported_pciidlist); i++) {
if (amdgpu_unsupported_pciidlist[i] == pdev->device)
-- 
2.48.1

Re: [PATCH V5 00/11] Add disable kernel queue support

2025-03-13 Thread Alex Deucher

On Thu, Mar 13, 2025 at 6:21 PM Rodrigo Siqueira  wrote:
>
> n 03/13, Alex Deucher wrote:
> > To better evaluate user queues, add a module parameter
> > to disable kernel queues.  With this set kernel queues
> > are disabled and only user queues are available.  This
> > frees up hardware resources for use in user queues which
> > would otherwise be used by kernel queues and provides
> > a way to validate user queues without the presence
> > of kernel queues.
>
> Hi Alex,
>
> I'm trying to understand how GFX and MES deal with different queues, and
> I used this patchset to guide me through that. In this sense, could you
> help me with the following points?
>
> FWIU, the GFX has what are called pipes, which in turn have hardware
> queues associated with them. For example, a GFX can have 2 pipes, and
> each pipe could have 2 hardware queues; or it could have 1 pipe and 8
> queue. Is this correct?

Right.  For gfx, compute, and SDMA you have pipes (called instances on
SDMA) and queues.  A pipe can only execute one queue at a time.  The
pipe will switch between all of the mapped queues.  You have storage
in memory (called an MDQ -- Memory Queue Descriptor) which defines the
state of the queue (GPU virtual addresses of the queue itself, save
areas, doorbell, etc.).  The queues that the pipe switches between are
defined by HQDs (Hardware Queue Descriptors).  These are basically
register based memory for the queues that the pipe can switch between.
The driver sets up an MQD for each queue that it creates.  The MQDs
are then handed to the MES firmware for mapping.  The MES firmware can
map a queue as a legacy queue (i.e. a kernel queue) or a user queue.
The difference is that a legacy queue is statically mapped to a HQD
and is never preempted.  User queues are dynamically mapped to the
HQDs by the MES firmware.  If there are more MQDs than HQDs, the MES
firmware will preempt other user queues to make sure each queue gets a
time slice.

>
> (for this next part, suppose 1 pipe 2 hardware queues)
> By default, one of the hardware queues is reserved for the Kernel Queue,
> and the user space could use the other. GFX has the MES block "connected"
> to all pipe queues, and MES is responsible for scheduling different ring
> buffers (in memory) in the pipe's hardware queue (effectively making the
> ring active). However, since the kernel queue is always present, MES
> only performs scheduling in one of the hardware queues. This scheduling
> occurs with the MES mapping and unmapping available Rings in memory to
> the hardware queue.
>
> Does the above description sound correct to you?  How about the below
> diagram? Does it look correct to you?

More or less.  The MES handles all of the queues (kernel or user).
The only real difference is that kernel queues are statically mapped
to an HQD while user queues are dynamically scheduled in the available
HQDs based on level of over-subscription.  E.g., if you have hardware
with 1 pipe and 2 HQDs you could have a kernel queue on 1 HQD and the
MES would schedule all of the user queues on the remaining 1 HQD.  If
you don't enable any kernel queues, then you have 2 HQDs that the MES
can use for scheduling user queues.

>
> (I hope the diagram looks fine in your email client; if not, I can
> attach a picture of it.)
>
> +---+
> |   GFX   
>   |
> | 
>   |
> | 
>   +-+ |
> |   +-+ (Hw Queue 0)  
>   | Kernel Queue (No eviction)  +--- No MES Scheduling|
> |   |(Hardware Queue 0)   | 
> --->| |   |   
>   |
> |PIPE 0 |   - |   
>   +-+   X |
> |   |(Hardware Queue 1)   |   
>  +--+-+   |
> |   |   - |--+
>  ||   |
> |   | |  |
>   ++ ||   |
> |   +-+  | (Hw Queue 1)   
>   || |   MES Schedules|   |
> |
> +---

[PATCH] drm/amdgpu: Enable ACA by default for psp v13_0_6/v13_0_14

2025-03-13 Thread Xiang Liu

Enable ACA by default for psp v13_0_6/v13_0_14.

Signed-off-by: Xiang Liu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 7cf8a3036828..cfec29835634 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -3785,9 +3785,11 @@ static void amdgpu_ras_check_supported(struct 
amdgpu_device *adev)
adev->ras_enabled = amdgpu_ras_enable == 0 ? 0 :
adev->ras_hw_enabled & amdgpu_ras_mask;
 
-   /* aca is disabled by default except for psp v13_0_12 */
+   /* aca is disabled by default except for psp v13_0_6/v13_0_12/v13_0_14 
*/
adev->aca.is_enabled =
-   (amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 12));
+   (amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 6) ||
+amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 12) 
||
+amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 14));
 
/* bad page feature is not applicable to specific app platform */
if (adev->gmc.is_app_apu &&
-- 
2.34.1

RE: [PATCH] drm/amdgpu: Fix computation for remain size of CPER ring

2025-03-13 Thread Zhang, Hawking

[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: Liu, Xiang(Dean) 
Sent: Thursday, March 13, 2025 11:28
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Zhou1, Tao ; 
Liu, Xiang(Dean) 
Subject: [PATCH] drm/amdgpu: Fix computation for remain size of CPER ring

The mistake of computation for remain size of CPER ring will cause unbreakable 
while cycle when CPER ring overflow.

Signed-off-by: Xiang Liu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c
index 47fe8a04e26a..d4e90785ee33 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cper.c
@@ -452,10 +452,10 @@ static u32 amdgpu_cper_ring_get_ent_sz(struct amdgpu_ring 
*ring, u64 pos)
return umin(rec_len, chunk);
 }

-void amdgpu_cper_ring_write(struct amdgpu_ring *ring,
- void *src, int count)
+void amdgpu_cper_ring_write(struct amdgpu_ring *ring, void *src, int
+count)
 {
u64 pos, wptr_old, rptr = *ring->rptr_cpu_addr & ring->ptr_mask;
+   int rec_cnt_dw = count >> 2;
u32 chunk, ent_sz;
u8 *s = (u8 *)src;

@@ -482,6 +482,9 @@ void amdgpu_cper_ring_write(struct amdgpu_ring *ring,
s += chunk;
}

+   if (ring->count_dw < rec_cnt_dw)
+   ring->count_dw = 0;
+
/* the buffer is overflow, adjust rptr */
if (((wptr_old < rptr) && (rptr <= ring->wptr)) ||
((ring->wptr < wptr_old) && (wptr_old < rptr)) || @@ -498,12 
+501,10 @@ void amdgpu_cper_ring_write(struct amdgpu_ring *ring,
pos = rptr;
} while (!amdgpu_cper_is_hdr(ring, rptr));
}
-   mutex_unlock(&ring->adev->cper.ring_lock);

-   if (ring->count_dw >= (count >> 2))
-   ring->count_dw -= (count >> 2);
-   else
-   ring->count_dw = 0;
+   if (ring->count_dw >= rec_cnt_dw)
+   ring->count_dw -= rec_cnt_dw;
+   mutex_unlock(&ring->adev->cper.ring_lock);
 }

 static u64 amdgpu_cper_ring_get_rptr(struct amdgpu_ring *ring)
--
2.34.1

RE: [PATCH] drm/amdgpu: format old RAS eeprom data into V3 version

2025-03-13 Thread Zhang, Hawking

[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: amd-gfx  On Behalf Of Tao Zhou
Sent: Wednesday, March 12, 2025 18:06
To: amd-gfx@lists.freedesktop.org
Cc: Zhou1, Tao 
Subject: [PATCH] drm/amdgpu: format old RAS eeprom data into V3 version

Clear old data and save it in V3 format.

v2: only format eeprom data for new ASICs.

Signed-off-by: Tao Zhou 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c   |  7 +
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c| 26 ++-
 .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.h|  1 +
 3 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 837f33698b38..d3b9b4d9fb89 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -3465,6 +3465,13 @@ int amdgpu_ras_init_badpage_info(struct amdgpu_device 
*adev)
adev, control->bad_channel_bitmap);
con->update_channel_flag = false;
}
+
+   /* The format action is only applied to new ASICs */
+   if (IP_VERSION_MAJ(amdgpu_ip_version(adev, UMC_HWIP, 0)) >= 12 
&&
+   control->tbl_hdr.version < RAS_TABLE_VER_V3)
+   if (!amdgpu_ras_eeprom_reset_table(control))
+   if (amdgpu_ras_save_bad_pages(adev, NULL))
+   dev_warn(adev->dev, "Failed to format 
RAS EEPROM data in V3
+version!\n");
}

return ret;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
index 09a6f8bc1a5a..71dddb8983ee 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c
@@ -413,9 +413,11 @@ static void amdgpu_ras_set_eeprom_table_version(struct 
amdgpu_ras_eeprom_control

switch (amdgpu_ip_version(adev, UMC_HWIP, 0)) {
case IP_VERSION(8, 10, 0):
-   case IP_VERSION(12, 0, 0):
hdr->version = RAS_TABLE_VER_V2_1;
return;
+   case IP_VERSION(12, 0, 0):
+   hdr->version = RAS_TABLE_VER_V3;
+   return;
default:
hdr->version = RAS_TABLE_VER_V1;
return;
@@ -443,7 +445,7 @@ int amdgpu_ras_eeprom_reset_table(struct 
amdgpu_ras_eeprom_control *control)
hdr->header = RAS_TABLE_HDR_VAL;
amdgpu_ras_set_eeprom_table_version(control);

-   if (hdr->version == RAS_TABLE_VER_V2_1) {
+   if (hdr->version >= RAS_TABLE_VER_V2_1) {
hdr->first_rec_offset = RAS_RECORD_START_V2_1;
hdr->tbl_size = RAS_TABLE_HEADER_SIZE +
RAS_TABLE_V2_1_INFO_SIZE;
@@ -461,7 +463,7 @@ int amdgpu_ras_eeprom_reset_table(struct 
amdgpu_ras_eeprom_control *control)
}

csum = __calc_hdr_byte_sum(control);
-   if (hdr->version == RAS_TABLE_VER_V2_1)
+   if (hdr->version >= RAS_TABLE_VER_V2_1)
csum += __calc_ras_info_byte_sum(control);
csum = -csum;
hdr->checksum = csum;
@@ -752,7 +754,7 @@ amdgpu_ras_eeprom_update_header(struct 
amdgpu_ras_eeprom_control *control)
"Saved bad pages %d reaches threshold value %d\n",
control->ras_num_bad_pages, 
ras->bad_page_cnt_threshold);
control->tbl_hdr.header = RAS_TABLE_HDR_BAD;
-   if (control->tbl_hdr.version == RAS_TABLE_VER_V2_1) {
+   if (control->tbl_hdr.version >= RAS_TABLE_VER_V2_1) {
control->tbl_rai.rma_status = 
GPU_RETIRED__ECC_REACH_THRESHOLD;
control->tbl_rai.health_percent = 0;
}
@@ -765,7 +767,7 @@ amdgpu_ras_eeprom_update_header(struct 
amdgpu_ras_eeprom_control *control)
amdgpu_dpm_send_rma_reason(adev);
}

-   if (control->tbl_hdr.version == RAS_TABLE_VER_V2_1)
+   if (control->tbl_hdr.version >= RAS_TABLE_VER_V2_1)
control->tbl_hdr.tbl_size = RAS_TABLE_HEADER_SIZE +
RAS_TABLE_V2_1_INFO_SIZE +
control->ras_num_recs * 
RAS_TABLE_RECORD_SIZE; @@ -805,7 +807,7 @@ 
amdgpu_ras_eeprom_update_header(struct amdgpu_ras_eeprom_control *control)
 * now calculate gpu health percent
 */
if (amdgpu_bad_page_threshold != 0 &&
-   control->tbl_hdr.version == RAS_TABLE_VER_V2_1 &&
+   control->tbl_hdr.version >= RAS_TABLE_VER_V2_1 &&
control->ras_num_bad_pages <= ras->bad_page_cnt_threshold)
control->tbl_rai.health_percent = ((ras->bad_page_cnt_threshold 
-
   control->ras_num_bad_pages) 
* 100) / @@ -818,7 +820,7 @@ amdgpu_ras_eeprom_update_header(struct 
amdg

Re: [PATCH 00/63] Fix CONFIG_DRM_USE_DYNAMIC_DEBUG=y

2025-03-13 Thread jim . cromie

hello everyone,

sorry for the late reply.  I have a cleaner version cooking now.
less inter-commit churn, by bringing more cleanups forward.

I'll send a -v2 soon.  (lets forget all the meandering crap versions I sent)

Louis, thanks for testing !

I wrote the test script and submod.ko so the lib/* parts would stand
by themselves.
And this time, I left the old DECLARE_ macro, so DRM doesnt get a
flag-day breakage :-)

But for ease of testing, I'll keep the DRM parts in the series.
Taking 1st N commits is normal workflow ?

On Fri, Feb 28, 2025 at 9:24 AM Louis Chauvet  wrote:
>
>
>
> Le 20/02/2025 à 10:45, Simona Vetter a écrit :
> > On Thu, Feb 20, 2025 at 09:31:41AM +0100, Greg KH wrote:
> >> On Fri, Jan 24, 2025 at 11:45:14PM -0700, Jim Cromie wrote:
> >>> This series fixes dynamic-debug's support for DRM debug-categories.
> >>> Classmaps-v1 evaded full review, and got committed in 2 chunks:
> >>>
> >>>b7b4eebdba7b..6ea3bf466ac6   # core dyndbg changes
> >>>0406faf25fb1..ee7d633f2dfb   # drm adoption
> >>>
> >>> DRM-CI found a regression during init with drm.debug=; the
> >>> static-keys under the drm-dbgs in drm.ko got enabled, but those in
> >>> drivers & helpers did not.
> >>>
> >>> Root Problem:
> >>>
> >>> DECLARE_DYNDBG_CLASSMAP violated a K&R rule "define once, refer
> >>> afterwards".  Replace it with DYNDBG_CLASSMAP_DEFINE (invoked once in
> >>> drm-core) and DYNDBG_CLASSMAP_USE (invoked repeatedly, in drivers &
> >>> helpers).
> >>>
> >>> _DEFINE exports the classmap it creates (in drm.ko), other modules
> >>> _USE the classmap.  The _USE adds a record ref'g the _DEFINEd (&
> >>> exported) classmap, in a 2nd __dyndbg_class_users section.
> >>>
> >>> So now at modprobe, dyndbg scans the new section after the 1st
> >>> __dyndbg_class_maps section, follows the linkage to the _DEFINEr
> >>> module, finds the (optional) kernel-param controlling the classmap,
> >>> examines its drm.debug=, and applies it to the module being
> >>> initialized.
> >>>
> >>> To recapitulate the multi-module problem wo DRM involvement, Add:
> >>>
> >>> A. tools/testing/selftests/dynamic_debug/*
> >>>
> >>> This alters pr_debugs in the test-modules, counts the results and
> >>> checks them against expectations.  It uses this formula to test most
> >>> of the control grammar, including the new class keyword.
> >>>
> >>> B. test_dynamic_debug_submod.ko
> >>>
> >>> This alters the test-module to build both parent & _submod ko's, with
> >>> _DEFINE and _USE inside #if/#else blocks.  This recap's DRM's 2 module
> >>> failure scenario, allowing A to exersize several cases.
> >>>
> >>> The #if/#else puts the 2 macro uses together for clarity, and gives
> >>> the 2 modules identical sets of debugs.
> >>>
> >>> Recent DRM-CI tests are here:
> >>>https://patchwork.freedesktop.org/series/139147/
> >>>
> >>> Previous rev:
> >>>
> >>> https://lore.kernel.org/lkml/20240716185806.1572048-1-jim.cro...@gmail.com/
> >>>
> >>> Noteworthy Additions:
> >>>
> >>> 1- drop class "protection" special case, per JBaron's preference.
> >>> only current use is marked BROKEN so nobody to affect.
> >>> now framed as policy-choice:
> >>> #define ddebug_client_module_protects_classes() false
> >>> subsystems wanting protection can change this.
> >>>
> >>> 2- compile-time arg-tests in DYNDBG_CLASSMAP_DEFINE
> >>> implement several required constraints, and fail obviously.
> >>>
> >>> 3- modprobe time check of conflicting class-id reservations
> >>> only affects 2+classmaps users.
> >>> compile-time solution not apparent.
> >>>
> >>> 4- dyndbg can now cause modprobe to fail.
> >>> needed to catch 3.
> >>> maybe some loose ends here on failure.
> >>>
> >>> 5- refactor & rename ddebug_attach_*module_classes
> >>> reduce repetetive boilerplate on 2 types: maps, users.
> >>> rework mostly brought forward in patchset to reduce churn
> >>> TBD: maybe squash more.
> >>>
> >>> Several recent trybot submissions (against drm-tip) have been passing
> >>> CI.BAT, and failing one or few CI.IGT tests randomly; re-tests do not
> >>> reliably repeat the failures.
> >>>
> >>> its also at github.com:jimc/linux.git
> >>>dd-fix-9[st]-ontip  &  dd-fix-9-13
> >>>
> >>> Ive been running it on my desktop w/o issues.
> >>>
> >>> The drivers/gpu/drm patches are RFC, I think there might be a single
> >>> place to call DRM_CLASSMAP_USE(drm_dedbug_classes) to replace the
> >>> sprinkling of _USEs in drivers and helpers.  IIRC, I tried adding a
> >>> _DEFINE into drm_drv.c, that didn't do it, so I punted for now.
> >>>
> >>> I think the dyndbg core additions are ready for review and merging
> >>> into a (next-next) test/integration tree.
> >>
> >> So whose tree should this go through?
> >
> > I'm trying to get some drm folks to review/test this, but thus far not
> > much success :-/ I think it's good stuff, but I'm somewhat hesitant if no
>
> I tested the VKMS driver with this, and it works!
>
> Tested-by:

[PATCH] drm/amdgpu: Add debug masks for HDCP LC FW testing

2025-03-13 Thread Dominik Kaszewski

HDCP Locality Check is being moved to FW, add debug flags to control
its behavior in existing hardware for validation purposes.

Signed-off-by: Dominik Kaszewski 
---
 drivers/gpu/drm/amd/include/amd_shared.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/amd/include/amd_shared.h 
b/drivers/gpu/drm/amd/include/amd_shared.h
index 485b713cfad0..4c95b885d1d0 100644
--- a/drivers/gpu/drm/amd/include/amd_shared.h
+++ b/drivers/gpu/drm/amd/include/amd_shared.h
@@ -358,6 +358,18 @@ enum DC_DEBUG_MASK {
 * @DC_DISABLE_CUSTOM_BRIGHTNESS_CURVE: If set, disable support for 
custom brightness curves
 */
DC_DISABLE_CUSTOM_BRIGHTNESS_CURVE = 0x4,
+
+   /**
+* @DC_HDCP_LC_FORCE_FW_ENABLE: If set, use HDCP Locality Check FW
+* path regardless of reported HW capabilities.
+*/
+   DC_HDCP_LC_FORCE_FW_ENABLE = 0x8,
+
+   /**
+* @DC_HDCP_LC_ENABLE_SW_FALLBACK If set, upon HDCP Locality Check FW
+* path failure, retry using legacy SW path.
+*/
+   DC_HDCP_LC_ENABLE_SW_FALLBACK = 0x10,
 };
 
 enum amd_dpm_forced_level;
-- 
2.43.0

RE: [PATCH] drm/amdgpu: Enable ACA by default for psp v13_0_6/v13_0_14

2025-03-13 Thread Zhang, Hawking

[AMD Official Use Only - AMD Internal Distribution Only]

Reviewed-by: Hawking Zhang 

Regards,
Hawking
-Original Message-
From: Liu, Xiang(Dean) 
Sent: Thursday, March 13, 2025 15:17
To: amd-gfx@lists.freedesktop.org
Cc: Zhang, Hawking ; Zhou1, Tao ; 
Liu, Xiang(Dean) 
Subject: [PATCH] drm/amdgpu: Enable ACA by default for psp v13_0_6/v13_0_14

Enable ACA by default for psp v13_0_6/v13_0_14.

Signed-off-by: Xiang Liu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 7cf8a3036828..cfec29835634 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -3785,9 +3785,11 @@ static void amdgpu_ras_check_supported(struct 
amdgpu_device *adev)
adev->ras_enabled = amdgpu_ras_enable == 0 ? 0 :
adev->ras_hw_enabled & amdgpu_ras_mask;

-   /* aca is disabled by default except for psp v13_0_12 */
+   /* aca is disabled by default except for psp v13_0_6/v13_0_12/v13_0_14 
*/
adev->aca.is_enabled =
-   (amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 12));
+   (amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 6) ||
+amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 12) 
||
+amdgpu_ip_version(adev, MP0_HWIP, 0) == IP_VERSION(13, 0, 14));

/* bad page feature is not applicable to specific app platform */
if (adev->gmc.is_app_apu &&
--
2.34.1

Re: [PATCH 1/2] drm: Create an app info option for wedge events

2025-03-13 Thread André Almeida


Em 12/03/2025 07:06, Raag Jadav escreveu:

On Tue, Mar 11, 2025 at 07:09:45PM +0200, Raag Jadav wrote:

On Mon, Mar 10, 2025 at 06:27:53PM -0300, André Almeida wrote:

Em 01/03/2025 02:53, Raag Jadav escreveu:

On Fri, Feb 28, 2025 at 06:54:12PM -0300, André Almeida wrote:

Hi Raag,

On 2/28/25 11:20, Raag Jadav wrote:

Cc: Lucas

On Fri, Feb 28, 2025 at 09:13:52AM -0300, André Almeida wrote:

When a device get wedged, it might be caused by a guilty application.
For userspace, knowing which app was the cause can be useful for some
situations, like for implementing a policy, logs or for giving a chance
for the compositor to let the user know what app caused the problem.
This is an optional argument, when `PID=-1` there's no information about
the app caused the problem, or if any app was involved during the hang.

Sometimes just the PID isn't enough giving that the app might be already
dead by the time userspace will try to check what was this PID's name,
so to make the life easier also notify what's the app's name in the user
event.

Signed-off-by: André Almeida 


[...]


len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED=");
@@ -562,6 +564,14 @@ int drm_dev_wedged_event(struct drm_device *dev, unsigned 
long method)
drm_info(dev, "device wedged, %s\n", method == DRM_WEDGE_RECOVERY_NONE ?
 "but recovered through reset" : "needs recovery");
+   if (info) {
+   snprintf(pid_string, sizeof(pid_string), "PID=%u", info->pid);
+   snprintf(comm_string, sizeof(comm_string), "APP=%s", 
info->comm);
+   } else {
+   snprintf(pid_string, sizeof(pid_string), "%s", "PID=-1");
+   snprintf(comm_string, sizeof(comm_string), "%s", "APP=none");
+   }

This is not much use for wedge cases that needs recovery, since at that point
the userspace will need to clean house anyway.

Which leaves us with only 'none' case and perhaps the need for standardization
of "optional telemetry collection".

Thoughts?


I had the feeling that 'none' was already meant to be used for that. Do you
think we should move to another naming? Given that we didn't reach the merge
window yet we could potentially change that name without much damage.


No, I meant thoughts on possible telemetry data that the drivers might
think is useful for userspace (along with PID) and can be presented in
a vendor agnostic manner (just like wedged event).


I'm not if I agree that this will only be used for telemetry and for the
`none` use case. As stated by Xaver, there's use case to know which app
caused the device to get wedged (like switching to software rendering) and
to display something for the user after the recovery is done (e.g. "The game
 stopped working and Plasma has reset").


Sure, but since this information is already available in coredump, I was
hoping to have something like a standardized DRM level coredump with both
vendor specific and agnostic sections, which the drivers can (and hopefully
transition to) use in conjunction with wedged event to provide wider
telemetry and is useful for all wedge cases.


This is more useful because,

1. It gives drivers an opportunity to present the telemetry that they are
interested in without needing to add a new event string (like PID or APP)
for their case.

2. When we consider wedging as a usecase, there's a lot more that goes
into it than an application that might be behaving strangely. So a wider
telemetry is what I would hope to look at in such a scenario.



I agree that coredump is the way to go for telemetry, we already have 
the name and PID of the guilty app there, along with more information 
about the GPU state. But I don't think it should be consumed like an 
uAPI. Even if we wire up some common DRM code for that, I don't think we 
can guarantee the stability of it as we can do for an uevent. coredump 
can be disabled and by default is only accessible by root.


So I think that coredump is really good after the fact and if the user 
is willing to go ahead and report a bug somewhere. But for the goal of 
notifying the compositor, the same uevent that the compositor is already 
listening to will have everything they need to deal with this reset.

47 matches

Mail list logo