op 15-05-14 17:48, Christian K?nig schreef: > Am 15.05.2014 16:18, schrieb Maarten Lankhorst: >> op 15-05-14 15:19, Christian K?nig schreef: >>> Am 15.05.2014 15:04, schrieb Maarten Lankhorst: >>>> op 15-05-14 11:42, Christian K?nig schreef: >>>>> Am 15.05.2014 11:38, schrieb Maarten Lankhorst: >>>>>> op 15-05-14 11:21, Christian K?nig schreef: >>>>>>> Am 15.05.2014 03:06, schrieb Maarten Lankhorst: >>>>>>>> op 14-05-14 17:29, Christian K?nig schreef: >>>>>>>>>> + /* did fence get signaled after we enabled the sw irq? */ >>>>>>>>>> + if >>>>>>>>>> (atomic64_read(&fence->rdev->fence_drv[fence->ring].last_seq) >= >>>>>>>>>> fence->seq) { >>>>>>>>>> + radeon_irq_kms_sw_irq_put(fence->rdev, fence->ring); >>>>>>>>>> + return false; >>>>>>>>>> + } >>>>>>>>>> + >>>>>>>>>> + fence->fence_wake.flags = 0; >>>>>>>>>> + fence->fence_wake.private = NULL; >>>>>>>>>> + fence->fence_wake.func = radeon_fence_check_signaled; >>>>>>>>>> + __add_wait_queue(&fence->rdev->fence_queue, &fence->fence_wake); >>>>>>>>>> + fence_get(f); >>>>>>>>> That looks like a race condition to me. The fence needs to be added >>>>>>>>> to the wait queue before the check, not after. >>>>>>>>> >>>>>>>>> Apart from that the whole approach looks like a really bad idea to >>>>>>>>> me. How for example is lockup detection supposed to happen with this? >>>>>>>> It's not a race condition because fence_queue.lock is held when this >>>>>>>> function is called. >>>>>>> Ah, I see. That's also the reason why you moved the wake_up_all out of >>>>>>> the processing function. >>>>>> Correct. :-) >>>>>>>> Lockup's a bit of a weird problem, the changes wouldn't allow core ttm >>>>>>>> code to handle the lockup any more, >>>>>>>> but any driver specific wait code would still handle this. I did this >>>>>>>> by design, because in future patches the wait >>>>>>>> function may be called from outside of the radeon driver. The official >>>>>>>> wait function takes a timeout parameter, >>>>>>>> so lockups wouldn't be fatal if the timeout is set to something like >>>>>>>> 30*HZ for example, it would still return >>>>>>>> and report that the function timed out. >>>>>>> Timeouts help with the detection of the lockup, but not at all with the >>>>>>> handling of them. >>>>>>> >>>>>>> What we essentially need is a wait callback into the driver that is >>>>>>> called in non atomic context without any locks held. >>>>>>> >>>>>>> This way we can block for the fence to become signaled with a timeout >>>>>>> and can then also initiate the reset handling if necessary. >>>>>>> >>>>>>> The way you designed the interface now means that the driver never gets >>>>>>> a chance to wait for the hardware to become idle and so never has the >>>>>>> opportunity to the reset the whole thing. >>>>>> You could set up a hangcheck timer like intel does, and end up with a >>>>>> reliable hangcheck detection that doesn't depend on cpu waits. :-) Or >>>>>> override the default wait function and restore the old behavior. >>>>> >>>>> Overriding the default wait function sounds better, please implement it >>>>> this way. >>>>> >>>>> Thanks, >>>>> Christian. >>>> >>>> Does this modification look sane? >>> Adding the timeout is on my todo list for quite some time as well, so this >>> part makes sense. >>> >>>> +static long __radeon_fence_wait(struct fence *f, bool intr, long timeout) >>>> +{ >>>> + struct radeon_fence *fence = to_radeon_fence(f); >>>> + u64 target_seq[RADEON_NUM_RINGS] = {}; >>>> + >>>> + target_seq[fence->ring] = fence->seq; >>>> + return radeon_fence_wait_seq_timeout(fence->rdev, target_seq, intr, >>>> timeout); >>>> +} >>> When this call is comming from outside the radeon driver you need to lock >>> rdev->exclusive_lock here to make sure not to interfere with a possible >>> reset. >> Ah thanks, I'll add that. >> >>>> .get_timeline_name = radeon_fence_get_timeline_name, >>>> .enable_signaling = radeon_fence_enable_signaling, >>>> .signaled = __radeon_fence_signaled, >>> Do we still need those callback when we implemented the wait callback? >> .get_timeline_name is used for debugging (trace events). >> .signaled is the non-blocking call to check if the fence is signaled or not. >> .enable_signaling is used for adding callbacks upon fence completion, the >> default 'fence_default_wait' uses it, so >> when it works no separate implementation is needed unless you want to do >> more than just waiting. >> It's also used when fence_add_callback is called. i915 can be patched to use >> it. ;-) > > I just meant enable_signaling, the other ones are fine with me. The problem > with enable_signaling is that it's called with a spin lock held, so we can't > sleep. > > While resetting the GPU could be moved out into a timer the problem here is > that I can't lock rdev->exclusive_lock in such situations. > > This means when i915 would call into radeon to enable signaling for a fence > we can't make sure that there is not GPU reset running on another CPU. And > touching the IRQ registers while a reset is going on is a really good recipe > to lockup the whole system. If you increase the irq counter on all rings before doing a gpu reset, adjust the state and call sw_irq_put when done this race could never happen. Or am I missing something?
~Maarten