On Wed, Mar 12, 2025 at 09:25:08AM +0100, Christian König wrote: >Am 11.03.25 um 18:13 schrieb Raag Jadav: >> On Mon, Mar 10, 2025 at 06:03:27PM -0400, Alex Deucher wrote: >>> On Mon, Mar 10, 2025 at 5:54 PM André Almeida <andrealm...@igalia.com> >>> wrote: >>>> Em 01/03/2025 03:04, Raag Jadav escreveu: >>>>> On Fri, Feb 28, 2025 at 06:49:43PM -0300, André Almeida wrote: >>>>>> Hi Raag, >>>>>> >>>>>> On 2/28/25 11:58, Raag Jadav wrote: >>>>>>> On Fri, Feb 28, 2025 at 09:13:53AM -0300, André Almeida wrote: >>>>>>>> To notify userspace about which app (if any) made the device get in a >>>>>>>> wedge state, make use of drm_wedge_app_info parameter, filling it with >>>>>>>> the app PID and name. >>>>>>>> >>>>>>>> Signed-off-by: André Almeida <andrealm...@igalia.com> >>>>>>>> --- >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++++++++++++++++-- >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 6 +++++- >>>>>>>> 2 files changed, 22 insertions(+), 3 deletions(-) >>>>>>>> >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>> index 00b9b87dafd8..e06adf6f34fd 100644 >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>> @@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct >>>>>>>> amdgpu_device *adev, >>>>>>>> atomic_set(&adev->reset_domain->reset_res, r); >>>>>>>> - if (!r) >>>>>>>> - drm_dev_wedged_event(adev_to_drm(adev), >>>>>>>> DRM_WEDGE_RECOVERY_NONE, NULL); >>>>>>>> + if (!r) { >>>>>>>> + struct drm_wedge_app_info aux, *info = NULL; >>>>>>>> + >>>>>>>> + if (job) { >>>>>>>> + struct amdgpu_task_info *ti; >>>>>>>> + >>>>>>>> + ti = amdgpu_vm_get_task_info_pasid(adev, >>>>>>>> job->pasid); >>>>>>>> + if (ti) { >>>>>>>> + aux.pid = ti->pid; >>>>>>>> + aux.comm = ti->process_name; >>>>>>>> + info = &aux; >>>>>>>> + amdgpu_vm_put_task_info(ti); >>>>>>>> + } >>>>>>>> + } >>>>>>> Is this guaranteed to be guilty app and not some scheduled worker? >>>>>> This is how amdgpu decides which app is the guilty one earlier in the >>>>>> code >>>>>> as in the print: >>>>>> >>>>>> ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid); >>>>>> >>>>>> "Process information: process %s pid %d thread %s pid %d\n" >>>>>> >>>>>> So I think it's consistent with what the driver thinks it's the guilty >>>>>> process. >>>>> Sure, but with something like app_info we're kind of hinting to userspace >>>>> that an application was _indeed_ involved with reset. Is that also >>>>> guaranteed? >>>>> >>>>> Is it possible that an application needlessly suffers from a false >>>>> positive >>>>> scenario (reset due to other factors)? >>>>> >>>> I asked Alex Deucher in IRC about that and yes, there's a chance that >>>> this is a false positive. However, for the majority of cases this is the >>>> right app that caused the hang. This is what amdgpu is doing for GL >>>> robustness as well and devcoredump, so it's very consistent with how >>>> amdgpu deals with this scenario even if the mechanism is still not perfect. >>> It's usually the guilty one, but it's not guaranteed. For example, >>> say you have a ROCm user queue and a gfx job submitted to a kernel >>> queue. The actual guilty job may be the ROCm user queue, but the >>> driver may not detect that the ROCm queue was hung until some other >>> event (e.g., memory pressure). However, the timer for the gfx job may >>> timeout before that happens on the ROCm queue so in that case the gfx >>> job would be incorrectly considered guilty. >> So it boils down to what are the chances of that happening and whether >> it's significant enough to open the door for API abuse. > > We are also working on an enforce_isolation parameter for amdgpu which only > allows one application at a time to use the hardware for the cost of > performance. > > The problem is simply that when you don't allow multiple applications to use > the HW at the same time you also don't get full utilization of the different > HW blocks. > > It can also be that a crash only happens because two applications do > something at the same time which is not supposed to happen at the same time. > Those issue are then usually fixed by firmware updates, but are really really > hard to debug. > > I don't see much potential for abuse here since you can't easily control from > userspace when a lockup happens. And the AMD Linux team has made quite a > bunch of requirements to the HW/FW engineers recently which should improve > the situation on future HW generations. > > So I think we should probably just document that reliability of this > information is driver and hardware specific and should be taken with a grain > of salt and call it a day.
Sounds reasonable. The new event string will need its own chapter anyway. Raag