Am 31.01.2016 um 19:50 schrieb Matthew Dawson: > On Sunday, January 24, 2016 10:49:00 AM EST Christian König wrote: >> Am 24.01.2016 um 07:18 schrieb Matthew Dawson: >>> When the radeon driver resets a gpu, it attempts to test whether all the >>> rings can successfully handle an IB. If these rings fail to respond, the >>> process will wait forever. Another gpu reset can't happen at this point, >>> as the current reset holds a lock required to do so. Instead, make all >>> the IB tests run with a timeout, so the system can attempt to recover >>> in this case. >>> >>> While this doesn't fix the underlying issue with card resets failing, it >>> gives the system a higher chance of recovering. These timeouts have been >>> confirmed to help both a Tathi and Hawaii card recover after a gpu reset. >>> >>> I haven't been able to test this on other cards, but everything compiles. >>> I wasn't sure what to use for a timeout value, so for now I've used >>> radeon_lockup_timeout. When that is 0, the timeout functionality in this >>> patch is also disabled. >>> >>> This also adds a new function, radeon_fence_wait_timeout, that behaves >>> like >>> radeon_fence_wait except allowing a different timeout value to be passed >>> to >>> radeon_fence_wait_seq_timeout. >>> >>> Signed-off-by: Matthew Dawson <matthew at mjdsystems.ca> >> Really good idea, but please don't use radeon_lockup_timeout for this >> cause the 10 seconds default of that is way to long. Instead just use a >> fixed, let's say 100ms timeout defined in radeon.h. > I originally tried 100ms, but I found that to be too short (my system would > just lockup completely (no network even)). I found 1s is long enough, and > isn't so long to be annoying. Would that be ok?
Yeah, 1s works as well. But that 100ms shouldn't be enough sounds like another bug to me. > >> Also don't define our own radeon_fence_wait_timeout, just use >> fence_wait_timeout(&fence->base, RADEON_IB_TEST_TIMEOUT) for this. > I switched everything over to this, however the ib test always times out after > a gpu reset (on startup, everything is normal). I'm not sure why > fence_wait_timeout isn't being signaled while radeon_fence_wait_seq_timeout > is. I'm suspicious that either radeon_fence_enable_signaling is not doing > something necessary to get the fence completion to actual signal > radeon_fence_default_wait, or because we hold the exclusive lock during a > reset radeon_fence_check_lockup never runs and thus never triggers the fence > or enables some irq lines. > > Is it worth digging through the interactions here to get dma-buf fences to > work correctly during a lockup, or would it be better to just keep > radeon_fence_wait_timeout? If option 1 is preferred, I'm happy to learn but I > need some help learning how it is supposed to work. In this case feel free to add the radeon_fence_wait_timeout. It's probably a good idea to get rid of the exclusive lock sooner or later in radeon as well, but that really is a long term project. If you're really bored I can give you tons of input where to start with that. Regards, Christian. > >> Apart from that the idea looks really good to me. >> >> Regards, >> Christian. >> > Thanks, > > Matthew > >>> --- >>> >>> drivers/gpu/drm/radeon/cik_sdma.c | 3 ++- >>> drivers/gpu/drm/radeon/r100.c | 3 ++- >>> drivers/gpu/drm/radeon/r600.c | 3 ++- >>> drivers/gpu/drm/radeon/r600_dma.c | 3 ++- >>> drivers/gpu/drm/radeon/radeon.h | 1 + >>> drivers/gpu/drm/radeon/radeon_fence.c | 25 ++++++++++++++++++++++--- >>> drivers/gpu/drm/radeon/radeon_vce.c | 3 ++- >>> drivers/gpu/drm/radeon/uvd_v1_0.c | 3 ++- >>> 8 files changed, 35 insertions(+), 9 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/radeon/cik_sdma.c >>> b/drivers/gpu/drm/radeon/cik_sdma.c index d16f2ee..014f5ca 100644 >>> --- a/drivers/gpu/drm/radeon/cik_sdma.c >>> +++ b/drivers/gpu/drm/radeon/cik_sdma.c >>> @@ -737,7 +737,8 @@ int cik_sdma_ib_test(struct radeon_device *rdev, >>> struct radeon_ring *ring)> >>> DRM_ERROR("radeon: failed to schedule ib (%d).\n", r); >>> return r; >>> >>> } >>> >>> - r = radeon_fence_wait(ib.fence, false); >>> + r = radeon_fence_wait_timeout(ib.fence, false, msecs_to_jiffies( >>> + (radeon_lockup_timeout) ? radeon_lockup_timeout : >>> MAX_SCHEDULE_TIMEOUT));> >>> if (r) { >>> >>> DRM_ERROR("radeon: fence wait failed (%d).\n", r); >>> return r; >>> >>> diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c >>> index 5eae0a8..5386c09 100644 >>> --- a/drivers/gpu/drm/radeon/r100.c >>> +++ b/drivers/gpu/drm/radeon/r100.c >>> @@ -3732,7 +3732,8 @@ int r100_ib_test(struct radeon_device *rdev, struct >>> radeon_ring *ring)> >>> DRM_ERROR("radeon: failed to schedule ib (%d).\n", r); >>> goto free_ib; >>> >>> } >>> >>> - r = radeon_fence_wait(ib.fence, false); >>> + r = radeon_fence_wait_timeout(ib.fence, false, msecs_to_jiffies( >>> + (radeon_lockup_timeout) ? radeon_lockup_timeout : >>> MAX_SCHEDULE_TIMEOUT));> >>> if (r) { >>> >>> DRM_ERROR("radeon: fence wait failed (%d).\n", r); >>> goto free_ib; >>> >>> diff --git a/drivers/gpu/drm/radeon/r600.c b/drivers/gpu/drm/radeon/r600.c >>> index cc2fdf0..8fbe5d7 100644 >>> --- a/drivers/gpu/drm/radeon/r600.c >>> +++ b/drivers/gpu/drm/radeon/r600.c >>> @@ -3381,7 +3381,8 @@ int r600_ib_test(struct radeon_device *rdev, struct >>> radeon_ring *ring)> >>> DRM_ERROR("radeon: failed to schedule ib (%d).\n", r); >>> goto free_ib; >>> >>> } >>> >>> - r = radeon_fence_wait(ib.fence, false); >>> + r = radeon_fence_wait_timeout(ib.fence, false, msecs_to_jiffies( >>> + (radeon_lockup_timeout) ? radeon_lockup_timeout : >>> MAX_SCHEDULE_TIMEOUT));> >>> if (r) { >>> >>> DRM_ERROR("radeon: fence wait failed (%d).\n", r); >>> goto free_ib; >>> >>> diff --git a/drivers/gpu/drm/radeon/r600_dma.c >>> b/drivers/gpu/drm/radeon/r600_dma.c index d2dd29a..3ff614d 100644 >>> --- a/drivers/gpu/drm/radeon/r600_dma.c >>> +++ b/drivers/gpu/drm/radeon/r600_dma.c >>> @@ -368,7 +368,8 @@ int r600_dma_ib_test(struct radeon_device *rdev, >>> struct radeon_ring *ring)> >>> DRM_ERROR("radeon: failed to schedule ib (%d).\n", r); >>> return r; >>> >>> } >>> >>> - r = radeon_fence_wait(ib.fence, false); >>> + r = radeon_fence_wait_timeout(ib.fence, false, msecs_to_jiffies( >>> + (radeon_lockup_timeout) ? radeon_lockup_timeout : >>> MAX_SCHEDULE_TIMEOUT));> >>> if (r) { >>> >>> DRM_ERROR("radeon: fence wait failed (%d).\n", r); >>> return r; >>> >>> diff --git a/drivers/gpu/drm/radeon/radeon.h >>> b/drivers/gpu/drm/radeon/radeon.h index 5ae6db9..0b698b6 100644 >>> --- a/drivers/gpu/drm/radeon/radeon.h >>> +++ b/drivers/gpu/drm/radeon/radeon.h >>> @@ -382,6 +382,7 @@ void radeon_fence_driver_force_completion(struct >>> radeon_device *rdev, int ring);> >>> int radeon_fence_emit(struct radeon_device *rdev, struct radeon_fence >>> **fence, int ring); void radeon_fence_process(struct radeon_device >>> *rdev, int ring); >>> bool radeon_fence_signaled(struct radeon_fence *fence); >>> >>> +int radeon_fence_wait_timeout(struct radeon_fence *fence, bool >>> interruptible, long timeout);> >>> int radeon_fence_wait(struct radeon_fence *fence, bool interruptible); >>> int radeon_fence_wait_next(struct radeon_device *rdev, int ring); >>> int radeon_fence_wait_empty(struct radeon_device *rdev, int ring); >>> >>> diff --git a/drivers/gpu/drm/radeon/radeon_fence.c >>> b/drivers/gpu/drm/radeon/radeon_fence.c index 05815c4..9fec805 100644 >>> --- a/drivers/gpu/drm/radeon/radeon_fence.c >>> +++ b/drivers/gpu/drm/radeon/radeon_fence.c >>> @@ -527,7 +527,7 @@ static long radeon_fence_wait_seq_timeout(struct >>> radeon_device *rdev,> >>> } >>> >>> /** >>> >>> - * radeon_fence_wait - wait for a fence to signal >>> + * radeon_fence_wait_timeout - wait for a fence to signal with timeout >>> >>> * >>> * @fence: radeon fence object >>> * @intr: use interruptible sleep >>> >>> @@ -535,9 +535,10 @@ static long radeon_fence_wait_seq_timeout(struct >>> radeon_device *rdev,> >>> * Wait for the requested fence to signal (all asics). >>> * @intr selects whether to use interruptable (true) or >>> non-interruptable >>> * (false) sleep when waiting for the fence. >>> >>> + * @timeout: maximum time to wait, or MAX_SCHEDULE_TIMEOUT for infinite >>> wait> >>> * Returns 0 if the fence has passed, error for all other cases. >>> */ >>> >>> -int radeon_fence_wait(struct radeon_fence *fence, bool intr) >>> +int radeon_fence_wait_timeout(struct radeon_fence *fence, bool intr, long >>> timeout)> >>> { >>> >>> uint64_t seq[RADEON_NUM_RINGS] = {}; >>> long r; >>> >>> @@ -552,9 +553,11 @@ int radeon_fence_wait(struct radeon_fence *fence, >>> bool intr)> >>> return fence_wait(&fence->base, intr); >>> >>> seq[fence->ring] = fence->seq; >>> >>> - r = radeon_fence_wait_seq_timeout(fence->rdev, seq, intr, >>> MAX_SCHEDULE_TIMEOUT); + r = radeon_fence_wait_seq_timeout(fence->rdev, >>> seq, intr, timeout);> >>> if (r < 0) { >>> >>> return r; >>> >>> + } else if (r == 0) { >>> + return -ETIMEDOUT; >>> >>> } >>> >>> r = fence_signal(&fence->base); >>> >>> @@ -564,6 +567,22 @@ int radeon_fence_wait(struct radeon_fence *fence, >>> bool intr)> >>> } >>> >>> /** >>> >>> + * radeon_fence_wait - wait for a fence to signal >>> + * >>> + * @fence: radeon fence object >>> + * @intr: use interruptible sleep >>> + * >>> + * Wait for the requested fence to signal (all asics). >>> + * @intr selects whether to use interruptable (true) or non-interruptable >>> + * (false) sleep when waiting for the fence. >>> + * Returns 0 if the fence has passed, error for all other cases. >>> + */ >>> +int radeon_fence_wait(struct radeon_fence *fence, bool intr) >>> +{ >>> + return radeon_fence_wait_timeout(fence, intr, MAX_SCHEDULE_TIMEOUT); >>> +} >>> + >>> +/** >>> >>> * radeon_fence_wait_any - wait for a fence to signal on any ring >>> * >>> * @rdev: radeon device pointer >>> >>> diff --git a/drivers/gpu/drm/radeon/radeon_vce.c >>> b/drivers/gpu/drm/radeon/radeon_vce.c index 7eb1ae7..7d80dad 100644 >>> --- a/drivers/gpu/drm/radeon/radeon_vce.c >>> +++ b/drivers/gpu/drm/radeon/radeon_vce.c >>> @@ -810,7 +810,8 @@ int radeon_vce_ib_test(struct radeon_device *rdev, >>> struct radeon_ring *ring)> >>> goto error; >>> >>> } >>> >>> - r = radeon_fence_wait(fence, false); >>> + r = radeon_fence_wait_timeout(fence, false, msecs_to_jiffies( >>> + (radeon_lockup_timeout) ? radeon_lockup_timeout : >>> MAX_SCHEDULE_TIMEOUT));> >>> if (r) { >>> >>> DRM_ERROR("radeon: fence wait failed (%d).\n", r); >>> >>> } else { >>> >>> diff --git a/drivers/gpu/drm/radeon/uvd_v1_0.c >>> b/drivers/gpu/drm/radeon/uvd_v1_0.c index c6b1cbc..35caa89 100644 >>> --- a/drivers/gpu/drm/radeon/uvd_v1_0.c >>> +++ b/drivers/gpu/drm/radeon/uvd_v1_0.c >>> @@ -522,7 +522,8 @@ int uvd_v1_0_ib_test(struct radeon_device *rdev, >>> struct radeon_ring *ring)> >>> goto error; >>> >>> } >>> >>> - r = radeon_fence_wait(fence, false); >>> + r = radeon_fence_wait_timeout(fence, false, msecs_to_jiffies( >>> + (radeon_lockup_timeout) ? radeon_lockup_timeout : >>> MAX_SCHEDULE_TIMEOUT));> >>> if (r) { >>> >>> DRM_ERROR("radeon: fence wait failed (%d).\n", r); >>> goto error; >