Re: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

Marek Olšák Wed, 02 Oct 2013 18:09:05 -0700

First of all, I can't complain about the reliability of the hardware
GPU reset. It's mostly the kernel driver that happens to run into a
deadlock at the same time.

Regarding the issue with fences, the problem is that the GPU reset
completes successfully according to dmesg, but X doesn't respond. I
can move the cursor on the screen, but I can't do anything else and
the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can
easily reproduce this, because it's the most common reason why a GPU
lockup leads to frozen X. The GPU actually recovers, but X is hung. I
can't tell whether the fences are just not signalled or whether there
is actually a real CPU deadlock I can't see.

This patch makes the problem go away and GPU resets are successful
(except for extreme cases, see below). With a small enough lockup
timeout, the lockups are just a minor annoyance and I thought I could
get through a piglit run just with a few tens or hundreds of GPU
resets...

A different type of deadlock showed up, though it needs a lot of
concurrently-running apps like piglit. What happened is that the
kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due
to a GPU hang while holding onto the exclusive lock, and another
thread wanting to do the GPU reset was unable to acquire the lock.

That said, I will use the patch locally, because it helps a lot. I got
a few lockups while writing this email and I'm glad I didn't have to
reboot.

Marek

On Wed, Oct 2, 2013 at 4:50 PM, Christian König <deathsim...@vodafone.de> wrote:
> Possible, but I would rather guess that this doesn't work because the IB test 
> runs into a deadlock situation and so the GPU reset never fully completes.
>
> Can you reproduce the problem?
>
> If you want to make GPU resets more reliable I would rather suggest to remove 
> the ring lock dependency.
> Then we should try to give all the fence wait functions a (reliable) timeout 
> and move reset handling a layer up into the ioctl functions. But for this you 
> need to rip out the old PM code first.
>
> Christian.
>
> Marek Olšák <mar...@gmail.com> schrieb:
>
>>I'm afraid signalling the fences with an IB test is not reliable.
>>
>>Marek
>>
>>On Wed, Oct 2, 2013 at 3:52 PM, Christian König <deathsim...@vodafone.de> 
>>wrote:
>>> NAK, after recovering from a lockup the first thing we do is signalling all 
>>> remaining fences with an IB test.
>>>
>>> If we don't recover we indeed signal all fences manually.
>>>
>>> Signalling all fences regardless of the outcome of the reset creates 
>>> problems with both types of partial resets.
>>>
>>> Christian.
>>>
>>> Marek Olšák <mar...@gmail.com> schrieb:
>>>
>>>>From: Marek Olšák <marek.ol...@amd.com>
>>>>
>>>>After a lockup, fences are not signalled sometimes, causing
>>>>the GEM_WAIT_IDLE ioctl to never return, which sometimes results
>>>>in an X server freeze.
>>>>
>>>>This fixes only one of many deadlocks which can occur during a lockup.
>>>>
>>>>Signed-off-by: Marek Olšák <marek.ol...@amd.com>
>>>>---
>>>> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++
>>>> 1 file changed, 5 insertions(+)
>>>>
>>>>diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
>>>>b/drivers/gpu/drm/radeon/radeon_device.c
>>>>index 841d0e0..7b97baa 100644
>>>>--- a/drivers/gpu/drm/radeon/radeon_device.c
>>>>+++ b/drivers/gpu/drm/radeon/radeon_device.c
>>>>@@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device *rdev)
>>>>       radeon_save_bios_scratch_regs(rdev);
>>>>       /* block TTM */
>>>>       resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev);
>>>>+
>>>>+      mutex_lock(&rdev->ring_lock);
>>>>+      radeon_fence_driver_force_completion(rdev);
>>>>+      mutex_unlock(&rdev->ring_lock);
>>>>+
>>>>       radeon_pm_suspend(rdev);
>>>>       radeon_suspend(rdev);
>>>>
>>>>--
>>>>1.8.1.2
>>>>
>>>>_______________________________________________
>>>>dri-devel mailing list
>>>>dri-devel@lists.freedesktop.org
>>>>http://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/radeon: signal all fences after lockup to avoid endless waiting in GEM_WAIT

Reply via email to