[PATCH] drm/amdgpu: add to custom amdgpu_drm_release drm_dev_enter/exit

2025-08-07 Thread vitaly.prosyak
From: Vitaly Prosyak User queues are disabled before GEM objects are released (protecting against user app crashes). No races with PCI hot-unplug (because drm_dev_enter prevents cleanup if device is being removed). Cc: Christian König Cc: Alex Deucher Signed-off-by: Vitaly Prosyak --- driver

[PATCH] drm/amdgpu: add to custom amdgpu_drm_release drm_dev_enter/exit

2025-08-07 Thread vitaly.prosyak
From: Vitaly Prosyak User queues are disabled before GEM objects are released (protecting against user app crashes). No races with PCI hot-unplug (because drm_dev_enter prevents cleanup if device is being removed). Signed-off-by: Vitaly Prosyak --- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 5 +

[PATCH 2/2] drm/amdgpu: fix use-after-free in amdgpu_userq_suspend+0x51a/0x5a0

2025-07-02 Thread vitaly.prosyak
From: Vitaly Prosyak [ +0.20] BUG: KASAN: slab-use-after-free in amdgpu_userq_suspend+0x51a/0x5a0 [amdgpu] [ +0.000817] Read of size 8 at addr 88812eec8c58 by task amd_pci_unplug/1733 [ +0.27] CPU: 10 UID: 0 PID: 1733 Comm: amd_pci_unplug Tainted: GW 6.14.0+ #2

[PATCH 1/2] Revert "drm/amdgpu: fix slab-use-after-free in amdgpu_userq_mgr_fini+0x70c"

2025-07-02 Thread vitaly.prosyak
From: Vitaly Prosyak This reverts commit 0203ef5eb3b2a3a10dd31bac8fc2fa3b439cbb09. The original patch moved `amdgpu_userq_mgr_fini()` to the driver's `postclose` callback, which is called after `drm_gem_release()` in the DRM file cleanup sequence. If a user application crashes or aborts without

[PATCH] drm/amdgpu: fix slab-use-after-free in amdgpu_userq_mgr_fini+0x70c

2025-06-18 Thread vitaly.prosyak
From: Vitaly Prosyak The issue was reproduced on NV10 using IGT pci_unplug test. It is expected that `amdgpu_driver_postclose_kms()` is called prior to `amdgpu_drm_release()`. However, the bug is that `amdgpu_fpriv` was freed in `amdgpu_driver_postclose_kms()`, and then later accessed in `amdgp

[PATCH] drm/amdgpu: fix slab-use-after-free in amdgpu_userq_mgr_fini+0x70c

2025-06-18 Thread vitaly.prosyak
From: Vitaly Prosyak The issue was reproduced on NV10 using IGT pci_unplug test. It is expected that `amdgpu_driver_postclose_kms()` is called prior to `amdgpu_drm_release()`. However, the bug is that `amdgpu_fpriv` was freed in `amdgpu_driver_postclose_kms()`, and then later accessed in `amdgp

[PATCH] drm/amdgpu/gfx10: Update cleaner shader for GFX10.1.10

2025-05-06 Thread vitaly.prosyak
From: Vitaly Prosyak The cleaner shader is a piece of GPU code that is used to clear or initialize certain GPU resources, such as Local Data Share (LDS), Vector General Purpose Registers (VGPRs), and Scalar General Purpose Registers (SGPRs). Cc: Christian König Cc: Alex Deucher Signed-off-by:

[PATCH] drm/amdgpu: fix usage slab after free

2024-11-13 Thread vitaly.prosyak
From: Vitaly Prosyak [ +0.21] BUG: KASAN: slab-use-after-free in drm_sched_entity_flush+0x6cb/0x7a0 [gpu_sched] [ +0.27] Read of size 8 at addr 8881b8605f88 by task amd_pci_unplug/2147 [ +0.23] CPU: 6 PID: 2147 Comm: amd_pci_unplug Not tainted 6.10.0+ #1 [ +0.16] Hardwa

[PATCH] drm/amdgpu: WIP fix usage slab after free

2024-11-13 Thread vitaly.prosyak
From: Vitaly Prosyak [ +0.21] BUG: KASAN: slab-use-after-free in drm_sched_entity_flush+0x6cb/0x7a0 [gpu_sched] [ +0.27] Read of size 8 at addr 8881b8605f88 by task amd_pci_unplug/2147 [ +0.23] CPU: 6 PID: 2147 Comm: amd_pci_unplug Not tainted 6.10.0+ #1 [ +0.16] Hardwa

[PATCH] drm/amdgpu: WIP fix usage slab after free

2024-11-13 Thread vitaly.prosyak
From: Vitaly Prosyak [ +0.21] BUG: KASAN: slab-use-after-free in drm_sched_entity_flush+0x6cb/0x7a0 [gpu_sched] [ +0.27] Read of size 8 at addr 8881b8605f88 by task amd_pci_unplug/2147 [ +0.23] CPU: 6 PID: 2147 Comm: amd_pci_unplug Not tainted 6.10.0+ #1 [ +0.16] Hardwa

[PATCH] drm/sched: Add error code parameter to drm_sched_start

2024-07-25 Thread vitaly.prosyak
From: Vitaly Prosyak The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for each job with a NULL parent in the pending list, making it difficult to disti

[PATCH 1/2] drm/sched: Add error code parameter to drm_sched_start

2024-07-24 Thread vitaly.prosyak
From: Vitaly Prosyak The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for each job with a NULL parent in the pending list, making it difficult to disti

[PATCH] drm/sched: Add error code parameter to drm_sched_start

2024-07-24 Thread vitaly.prosyak
From: Vitaly Prosyak The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for each job with a NULL parent in the pending list, making it difficult to disti

[PATCH] drm/amdgpu: Add error parameter to amdgpu_fence_driver_force_completion

2024-07-24 Thread vitaly.prosyak
From: Vitaly Prosyak In the case of a queue reset, we need the ability to customize the error code from -ECANCELED to -ENODATA for scenarios where the queue reset is successful. It was decided to use -ECANCELED for GPU reset cases and -ENODATA for queue reset cases. This change introduces an erro

[PATCH] drm/amdkfd: fix NULL pointer dereference

2024-04-13 Thread vitaly.prosyak
From: Vitaly Prosyak [ +0.006038] BUG: kernel NULL pointer dereference, address: 0028 [ +0.006969] #PF: supervisor read access in kernel mode [ +0.005139] #PF: error_code(0x) - not-present page [ +0.005139] PGD 0 P4D 0 [ +0.002530] Oops: [#1] PREEMPT SMP NOPTI [ +0.0043

[PATCH] drm/sched: fix null-ptr-deref in init entity

2024-03-14 Thread vitaly.prosyak
From: Vitaly Prosyak The bug can be triggered by sending an amdgpu_cs_wait_ioctl to the AMDGPU DRM driver on any ASICs with valid context. The bug was reported by Joonkyo Jung . For example the following code: static void Syzkaller2(int fd) { union drm_amdgpu_ctx arg1; un

[PATCH] drm/sched: fix null-ptr-deref in init entity

2024-03-13 Thread vitaly.prosyak
From: Vitaly Prosyak The bug can be triggered by sending an amdgpu_cs_wait_ioctl to the AMDGPU DRM driver on any ASICs with valid context. The bug was reported by Joonkyo Jung . For example the following code: static void Syzkaller2(int fd) { union drm_amdgpu_ctx arg1; un

[PATCH] drm/scheduler: fix null-ptr-deref in init entity

2024-03-13 Thread vitaly.prosyak
From: Vitaly Prosyak The bug can be triggered by sending an amdgpu_cs_wait_ioctl to the AMDGPU DRM driver on any ASICs with valid context. The bug was reported by Joonkyo Jung . For example the following code: static void Syzkaller2(int fd) { union drm_amdgpu_ctx arg1; un

[PATCH] drm/amdgpu: fix use-after-free bug

2024-03-07 Thread vitaly.prosyak
From: Vitaly Prosyak The bug can be triggered by sending a single amdgpu_gem_userptr_ioctl to the AMDGPU DRM driver on any ASICs with an invalid address and size. The bug was reported by Joonkyo Jung . For example the following code: static void Syzkaller1(int fd) { struct drm_amdgpu_gem

[PATCH] drm/amdgpu: fix use-after-free bug

2024-03-07 Thread vitaly.prosyak
From: Vitaly Prosyak The bug can be triggered by sending a single amdgpu_gem_userptr_ioctl to the AMDGPU DRM driver on any ASICs with an invalid address and size. The bug was reported by Joonkyo Jung . For example the following code: static void Syzkaller1(int fd) { struct drm_amdgpu_gem

[PATCH] drm/amdgpu: fix use-after-free bug

2024-03-06 Thread vitaly.prosyak
From: Vitaly Prosyak The bug can be triggered by sending a single amdgpu_gem_userptr_ioctl to the AMDGPU DRM driver on any ASICs with an invalid address and size. The bug was reported by Joonkyo Jung . For example the following code: static void Syzkaller1(int fd) { struct drm_amdgpu_gem

[PATCH] drm/amdgpu: check flag ring->no_scheduler before usage

2024-01-20 Thread vitaly.prosyak
From: Vitaly Prosyak The issue started to appear after the following commit 11b3b9f461c5c4f700f6c8da202fcc2fd6418e1f (scheduler to variable number of run-queues). The scheduler flag ready (ring->sched.ready) could not be used to validate multiple scenarios, for example, check job is running

[PATCH] drm/amdgpu: fix software pci_unplug on some chips

2023-10-11 Thread vitaly.prosyak
From: Vitaly Prosyak When software 'pci unplug' using IGT is executed we got a sysfs directory entry is NULL for differant ras blocks like hdp, umc, etc. Before call 'sysfs_remove_file_from_group' and 'sysfs_remove_group' check that 'sd' is not NULL. [ +0.01] RIP: 0010:sysfs_remove_group+0

[PATCH] drm/sched: Check scheduler work queue before calling timeout handling

2023-05-10 Thread vitaly.prosyak
From: Vitaly Prosyak During an IGT GPU reset test we see again oops despite of commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling timeout handling). It uses ready condition whether to call drm_sched_fault which unwind the TDR leads to GPU reset. However it looks the ready con

[PATCH] drm/sched: Check scheduler work queue before calling timeout handling

2023-05-10 Thread vitaly.prosyak
From: Vitaly Prosyak During an IGT GPU reset test we see again oops despite of commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling timeout handling). It uses ready condition whether to call drm_sched_fault which unwind the TDR leads to GPU reset. However it looks the ready con

[PATCH] drm/sched: Check scheduler work queue before calling timeout handling

2023-05-09 Thread vitaly.prosyak
From: Vitaly Prosyak During an IGT GPU reset test we see again oops despite of commit 0c8c901aaaebc9bf8bf189ffc116e678f7a2dc16 drm/sched: Check scheduler ready before calling timeout handling. It uses ready condition whether to call drm_sched_fault which unwind the TDR leads to GPU reset. Howeve

[PATCH 2/3] drm/amdgpu: always sending PSP messages LOAD_ASD and UNLOAD_TA

2023-01-28 Thread vitaly.prosyak
From: Vitaly Prosyak We allow sending PSP messages LOAD_ASD and UNLOAD_TA without acquiring a lock in drm_dev_enter during driver unload because we must call drm_dev_unplug as the beginning of unload driver sequence. Added WARNING if other PSP messages are sent without a lock. After this commit,

[PATCH 2/3] drm/amdgpu: always sending PSP messages LOAD_ASD and UNLOAD_TA

2023-01-26 Thread vitaly.prosyak
From: Vitaly Prosyak We allow sending PSP messages LOAD_ASD and UNLOAD_TA without acquiring a lock in drm_dev_enter during driver unload because we must call drm_dev_unplug as the beginning of unload driver sequence. Added WARNING if other PSP messages are sent without a lock. After this commit,

[PATCH 1/3] Revert "drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled"

2023-01-26 Thread vitaly.prosyak
From: Vitaly Prosyak This reverts commit fac53471d0ea9693d314aa2df08d62b2e7e3a0f8. The following change: move the drm_dev_unplug call after amdgpu_driver_unload_kms in amdgpu_pci_remove. The reason is the following: amdgpu_pci_remove calls drm_dev_unregister and it should be called first to ensur

[PATCH 2/3] drm/amdgpu: always sending PSP messages LOAD_ASD and UNLOAD_TA

2023-01-25 Thread vitaly.prosyak
From: Vitaly Prosyak We allow sending PSP messages LOAD_ASD and UNLOAD_TA without acquiring a lock in drm_dev_enter during driver unload because we must call drm_dev_unplug as the beginning of unload driver sequence. Added WARNING if other PSP messages are sent without a lock. After this commit,

[PATCH 2/3] drm/amdgpu: always sending PSP messages LOAD_ASD and UNLOAD_TA

2023-01-25 Thread vitaly.prosyak
From: Vitaly Prosyak We allow sending PSP messages LOAD_ASD and UNLOAD_TA without acquiring a lock in drm_dev_enter during driver unload because we must call drm_dev_unplug as the beginning of unload driver sequence. Added WARNING if other PSP messages are sent without a lock. After this commit,

[PATCH 3/3] drm/amdgpu: use pci_dev_is_disconnected

2023-01-25 Thread vitaly.prosyak
From: Vitaly Prosyak Added condition for pci_dev_is_disconnected and keeps drm_dev_is_unplugged to check whether we should unmap MMIO. Suggested by Alex regarding pci_dev_is_disconnected. Suggested by Christian keeping drm_dev_is_unplugged. Signed-off-by: Vitaly Prosyak Reviewed-by Alex Deucher

[PATCH 1/3] Revert "drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled"

2023-01-25 Thread vitaly.prosyak
From: Vitaly Prosyak This reverts commit fac53471d0ea9693d314aa2df08d62b2e7e3a0f8. The following change: move the drm_dev_unplug call after amdgpu_driver_unload_kms in amdgpu_pci_remove. The reason is the following: amdgpu_pci_remove calls drm_dev_unregister and it should be called first to ensur

[PATCH 2/2] drm/amdgpu: always sending PSP messages LOAD_ASD and UNLOAD_TA

2023-01-20 Thread vitaly.prosyak
From: Vitaly Prosyak We allow sending PSP messages LOAD_ASD and UNLOAD_TA without acquiring a lock in drm_dev_enter during driver unload because we must call drm_dev_unplug as the beginning of unload driver sequence. Added WARNING if other PSP messages are sent without a lock. After this commit,

[PATCH 1/2] Revert "drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled"

2023-01-20 Thread vitaly.prosyak
From: Vitaly Prosyak This reverts commit fac53471d0ea9693d314aa2df08d62b2e7e3a0f8. The following change: move the drm_dev_unplug call after amdgpu_driver_unload_kms in amdgpu_pci_remove. The reason is the following: amdgpu_pci_remove calls drm_dev_unregister and it should be called first to ensur

[PATCH] drm/amdgpu: revert part of a commit fac53471d0ea9

2023-01-13 Thread vitaly.prosyak
From: Vitaly Prosyak Revert the following change: move drm_dev_unplug call after amdgpu_driver_unload_kms in amdgpu_pci_remove. The reason is following: amdgpu_pci_remove calls drm_dev_unregister and it should be called first to ensure userspace can't access the device instance anymore. If we cal