First of all, I can't complain about the reliability of the hardware GPU reset. It's mostly the kernel driver that happens to run into a deadlock at the same time.
Regarding the issue with fences, the problem is that the GPU reset completes successfully according to dmesg, but X doesn't respond. I can move the cursor on the screen, but I can't do anything else and the UI is frozen. gdb says that X is stuck in GEM_WAIT_IDLE. I can easily reproduce this, because it's the most common reason why a GPU lockup leads to frozen X. The GPU actually recovers, but X is hung. I can't tell whether the fences are just not signalled or whether there is actually a real CPU deadlock I can't see. This patch makes the problem go away and GPU resets are successful (except for extreme cases, see below). With a small enough lockup timeout, the lockups are just a minor annoyance and I thought I could get through a piglit run just with a few tens or hundreds of GPU resets... A different type of deadlock showed up, though it needs a lot of concurrently-running apps like piglit. What happened is that the kernel driver was stuck/deadlocked in radeon_cs_ioctl presumably due to a GPU hang while holding onto the exclusive lock, and another thread wanting to do the GPU reset was unable to acquire the lock. That said, I will use the patch locally, because it helps a lot. I got a few lockups while writing this email and I'm glad I didn't have to reboot. Marek On Wed, Oct 2, 2013 at 4:50 PM, Christian König <deathsim...@vodafone.de> wrote: > Possible, but I would rather guess that this doesn't work because the IB test > runs into a deadlock situation and so the GPU reset never fully completes. > > Can you reproduce the problem? > > If you want to make GPU resets more reliable I would rather suggest to remove > the ring lock dependency. > Then we should try to give all the fence wait functions a (reliable) timeout > and move reset handling a layer up into the ioctl functions. But for this you > need to rip out the old PM code first. > > Christian. > > Marek Olšák <mar...@gmail.com> schrieb: > >>I'm afraid signalling the fences with an IB test is not reliable. >> >>Marek >> >>On Wed, Oct 2, 2013 at 3:52 PM, Christian König <deathsim...@vodafone.de> >>wrote: >>> NAK, after recovering from a lockup the first thing we do is signalling all >>> remaining fences with an IB test. >>> >>> If we don't recover we indeed signal all fences manually. >>> >>> Signalling all fences regardless of the outcome of the reset creates >>> problems with both types of partial resets. >>> >>> Christian. >>> >>> Marek Olšák <mar...@gmail.com> schrieb: >>> >>>>From: Marek Olšák <marek.ol...@amd.com> >>>> >>>>After a lockup, fences are not signalled sometimes, causing >>>>the GEM_WAIT_IDLE ioctl to never return, which sometimes results >>>>in an X server freeze. >>>> >>>>This fixes only one of many deadlocks which can occur during a lockup. >>>> >>>>Signed-off-by: Marek Olšák <marek.ol...@amd.com> >>>>--- >>>> drivers/gpu/drm/radeon/radeon_device.c | 5 +++++ >>>> 1 file changed, 5 insertions(+) >>>> >>>>diff --git a/drivers/gpu/drm/radeon/radeon_device.c >>>>b/drivers/gpu/drm/radeon/radeon_device.c >>>>index 841d0e0..7b97baa 100644 >>>>--- a/drivers/gpu/drm/radeon/radeon_device.c >>>>+++ b/drivers/gpu/drm/radeon/radeon_device.c >>>>@@ -1552,6 +1552,11 @@ int radeon_gpu_reset(struct radeon_device *rdev) >>>> radeon_save_bios_scratch_regs(rdev); >>>> /* block TTM */ >>>> resched = ttm_bo_lock_delayed_workqueue(&rdev->mman.bdev); >>>>+ >>>>+ mutex_lock(&rdev->ring_lock); >>>>+ radeon_fence_driver_force_completion(rdev); >>>>+ mutex_unlock(&rdev->ring_lock); >>>>+ >>>> radeon_pm_suspend(rdev); >>>> radeon_suspend(rdev); >>>> >>>>-- >>>>1.8.1.2 >>>> >>>>_______________________________________________ >>>>dri-devel mailing list >>>>dri-devel@lists.freedesktop.org >>>>http://lists.freedesktop.org/mailman/listinfo/dri-devel _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel