radeon: handle lockup in delayed work, v2

Christian König Mon, 4 Aug 2014 16:37:56 +0200

> It'a pain to deal with gpu reset.

Yeah, well that's nothing new.


> I've now tried other solutions but that would mean reverting to the old style 
> during gpu lockup recovery, and only running the delayed work when !lockup.
> But this meant that the timeout was useless to add. I think the cleanest is 
> keeping the v2 patch, because potentially any waiting code can be called 
> during lockup recovery.

The lockup code itself should never call any waiting code and V2 doesn't 
seem to handle a couple of cases correctly either.

How about moving the fence waiting out of the reset code?

Christian.

Am 04.08.2014 um 15:34 schrieb Maarten Lankhorst:
> Hey,
>
> op 04-08-14 13:57, Christian K?nig schreef:
>> Am 04.08.2014 um 10:55 schrieb Maarten Lankhorst:
>>> op 04-08-14 10:36, Christian K?nig schreef:
>>>> Hi Maarten,
>>>>
>>>> Sorry for the delay. I've got way to much todo recently.
>>>>
>>>> Am 01.08.2014 um 19:46 schrieb Maarten Lankhorst:
>>>>> On 01-08-14 18:35, Christian K?nig wrote:
>>>>>> Am 31.07.2014 um 17:33 schrieb Maarten Lankhorst:
>>>>>>> Signed-off-by: Maarten Lankhorst <maarten.lankhorst at canonical.com>
>>>>>>> ---
>>>>>>> V1 had a nasty bug breaking gpu lockup recovery. The fix is not
>>>>>>> allowing radeon_fence_driver_check_lockup to take exclusive_lock,
>>>>>>> and kill it during lockup recovery instead.
>>>>>> That looks like the delayed work starts running as soon as we submit a 
>>>>>> fence, and not when it's needed for waiting.
>>>>>>
>>>>>> Since it's a backup for failing IRQs I would rather put it into 
>>>>>> radeon_irq_kms.c and start/stop it when the IRQs are started/stoped.
>>>>> The delayed work is not just for failing irq's, it's also the handler 
>>>>> that's used to detect lockups, which is why I trigger after processing 
>>>>> fences, and reset the timer after processing.
>>>> The idea was turning the delayed work on and off when we turn the irq on 
>>>> and off as well, processing of the delayed work handler can still happen 
>>>> in radeon_fence.c
>>>>
>>>>> Specifically what happened was this scenario:
>>>>>
>>>>> - lock up occurs
>>>>> - write lock taken by gpu_reset
>>>>> - delayed work runs, tries to acquire read lock, blocks
>>>>> - gpu_reset tries to cancel delayed work synchronously
>>>>> - has to wait for delayed work to finish -> deadlock
>>>> Why do you want to disable the work item from the lockup handler in the 
>>>> first place?
>>>>
>>>> Just take the exclusive lock in the work item, when it concurrently runs 
>>>> with the lockup handler it will just block for the lockup handler to 
>>>> complete.
>>> With the delayed work radeon_fence_wait no longer handles unreliable 
>>> interrupts itself, so it has to run from the lockup handler.
>>> But an alternative solution could be adding a radeon_fence_wait_timeout, 
>>> ignore the timeout and check if fence is signaled on timeout.
>>> This would probably be a cleaner solution.
>> Yeah, agree. Manually specifying a timeout in the fence wait on lockup 
>> handling sounds like the best alternative to me.
> It'a pain to deal with gpu reset.
>
> I've now tried other solutions but that would mean reverting to the old style 
> during gpu lockup recovery, and only running the delayed work when !lockup.
> But this meant that the timeout was useless to add. I think the cleanest is 
> keeping the v2 patch, because potentially any waiting code can be called 
> during lockup recovery.
>
> ~Maarten
>

[PATCH 09/19] drm/radeon: handle lockup in delayed work, v2

Reply via email to