On Wed, Mar 12, 2025 at 09:25:08AM +0100, Christian König wrote:
>Am 11.03.25 um 18:13 schrieb Raag Jadav:
>> On Mon, Mar 10, 2025 at 06:03:27PM -0400, Alex Deucher wrote:
>>> On Mon, Mar 10, 2025 at 5:54 PM André Almeida <andrealm...@igalia.com> 
>>> wrote:
>>>> Em 01/03/2025 03:04, Raag Jadav escreveu:
>>>>> On Fri, Feb 28, 2025 at 06:49:43PM -0300, André Almeida wrote:
>>>>>> Hi Raag,
>>>>>>
>>>>>> On 2/28/25 11:58, Raag Jadav wrote:
>>>>>>> On Fri, Feb 28, 2025 at 09:13:53AM -0300, André Almeida wrote:
>>>>>>>> To notify userspace about which app (if any) made the device get in a
>>>>>>>> wedge state, make use of drm_wedge_app_info parameter, filling it with
>>>>>>>> the app PID and name.
>>>>>>>>
>>>>>>>> Signed-off-by: André Almeida <andrealm...@igalia.com>
>>>>>>>> ---
>>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +++++++++++++++++--
>>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  6 +++++-
>>>>>>>>    2 files changed, 22 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index 00b9b87dafd8..e06adf6f34fd 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -6123,8 +6123,23 @@ int amdgpu_device_gpu_recover(struct 
>>>>>>>> amdgpu_device *adev,
>>>>>>>>            atomic_set(&adev->reset_domain->reset_res, r);
>>>>>>>> -  if (!r)
>>>>>>>> -          drm_dev_wedged_event(adev_to_drm(adev), 
>>>>>>>> DRM_WEDGE_RECOVERY_NONE, NULL);
>>>>>>>> +  if (!r) {
>>>>>>>> +          struct drm_wedge_app_info aux, *info = NULL;
>>>>>>>> +
>>>>>>>> +          if (job) {
>>>>>>>> +                  struct amdgpu_task_info *ti;
>>>>>>>> +
>>>>>>>> +                  ti = amdgpu_vm_get_task_info_pasid(adev, 
>>>>>>>> job->pasid);
>>>>>>>> +                  if (ti) {
>>>>>>>> +                          aux.pid = ti->pid;
>>>>>>>> +                          aux.comm = ti->process_name;
>>>>>>>> +                          info = &aux;
>>>>>>>> +                          amdgpu_vm_put_task_info(ti);
>>>>>>>> +                  }
>>>>>>>> +          }
>>>>>>> Is this guaranteed to be guilty app and not some scheduled worker?
>>>>>> This is how amdgpu decides which app is the guilty one earlier in the 
>>>>>> code
>>>>>> as in the print:
>>>>>>
>>>>>>      ti = amdgpu_vm_get_task_info_pasid(ring->adev, job->pasid);
>>>>>>
>>>>>>      "Process information: process %s pid %d thread %s pid %d\n"
>>>>>>
>>>>>> So I think it's consistent with what the driver thinks it's the guilty
>>>>>> process.
>>>>> Sure, but with something like app_info we're kind of hinting to userspace
>>>>> that an application was _indeed_ involved with reset. Is that also 
>>>>> guaranteed?
>>>>>
>>>>> Is it possible that an application needlessly suffers from a false 
>>>>> positive
>>>>> scenario (reset due to other factors)?
>>>>>
>>>> I asked Alex Deucher in IRC about that and yes, there's a chance that
>>>> this is a false positive. However, for the majority of cases this is the
>>>> right app that caused the hang. This is what amdgpu is doing for GL
>>>> robustness as well and devcoredump, so it's very consistent with how
>>>> amdgpu deals with this scenario even if the mechanism is still not perfect.
>>> It's usually the guilty one, but it's not guaranteed.  For example,
>>> say you have a ROCm user queue and a gfx job submitted to a kernel
>>> queue.  The actual guilty job may be the ROCm user queue, but the
>>> driver may not detect that the ROCm queue was hung until some other
>>> event (e.g., memory pressure).  However, the timer for the gfx job may
>>> timeout before that happens on the ROCm queue so in that case the gfx
>>> job would be incorrectly considered guilty.
>> So it boils down to what are the chances of that happening and whether
>> it's significant enough to open the door for API abuse.
> 
> We are also working on an enforce_isolation parameter for amdgpu which only 
> allows one application at a time to use the hardware for the cost of 
> performance.
> 
> The problem is simply that when you don't allow multiple applications to use 
> the HW at the same time you also don't get full utilization of the different 
> HW blocks.
> 
> It can also be that a crash only happens because two applications do 
> something at the same time which is not supposed to happen at the same time. 
> Those issue are then usually fixed by firmware updates, but are really really 
> hard to debug.
> 
> I don't see much potential for abuse here since you can't easily control from 
> userspace when a lockup happens. And the AMD Linux team has made quite a 
> bunch of requirements to the HW/FW engineers recently which should improve 
> the situation on future HW generations.
> 
> So I think we should probably just document that reliability of this 
> information is driver and hardware specific and should be taken with a grain 
> of salt and call it a day.

Sounds reasonable. The new event string will need its own chapter anyway.

Raag

Reply via email to