On 03.07.2012 16:09, Jerome Glisse wrote: > On Tue, Jul 3, 2012 at 5:26 AM, Christian K?nig <deathsimple at vodafone.de> > wrote: >> [SNIP] >> Yeah, but I thought that fixing those oops as the second step. I see the >> fact that suspend/resume is unpinning all the ttm memory and then pinning it >> again as a bug that needs to be fixed. Or as an alternative we could split >> the suspend/resume calls into suspend/disable and resume/enable calls, where >> we only call disable/enable in the gpu_reset path rather than a complete >> suspend/resume (not really sure about that). > Fixing oops are not second step, they are first step. I am not saying > that the suspend/resume as it happens right now is a good thing, i am > saying it's what it's and we have to deal with it until we do > something better. There is no excuse to not fix oops with a simple > patch 16 lines patch. Completely agree.
>> Also a GPU reset isn't such a rare event, currently it just occurs when >> userspace is doing something bad, for example submitting an invalid shader >> or stuff like that. But with VM and IO protection coming into the driver we >> are going to need a GPU reset when there is an protection fault, and with >> that it is really desirable to just reset the hardware in a way where we can >> say: This IB was faulty skip over it and resume with whatever is after it on >> the ring. > There is mecanism to handle that properly from irq handler that AMD > need to release, the pagefault thing could be transparent and should > only need few lines in the irq handler (i think i did a patch for that > and sent it to AMD for review but i am wondering if i wasn't lacking > some doc). I also searched the AMD internal docs for a good description of how the lightweight recovery is supposed to work, but haven't found anything clear so far. My expectation is that there should be something like a "abort current IB" command you can issue by writing an register, but that doesn't seems to be the case. >> And todo that we need to keep the auxiliary data like sub allocator memory, >> blitting shader bo, and especially vm page tables at the same place in >> hardware memory. > I agree that we need a lightweight reset but we need to keep the heavy > reset as it is right now, if you want to do a light weight reset do it > as a new function. I always intended to have two reset path, hint gpu > soft reset name vs what is call hard reset but not released, i even > made patch for that long time ago but never got them cleared from AMD > review. My idea was to pass in some extra informations, so asic_reset more clearly knows what todo. An explicit distinction between a soft and a hard reset also seems like a possible solution, but sounds like a bit of code duplication. >>> I stress it we need at very least a mutex to protect gpu reset and i >>> will stand on that position because there is no other way around. >>> Using rw lock have a bonus of allowing proper handling of gpu reset >>> failure and that what the patch i sent to linus fix tree is about, so >>> to make drm next merge properly while preserving proper behavior in >>> gpu reset failure the rw semaphore is the best option. >> Oh well, I'm not arguing that we don't need a look here. I'm just >> questioning to put it at the ioctl level (e.g. the driver entry from >> userspace), that wasn't such a good idea with the cs_mutex and doesn't seems >> like a good idea now. Instead we should place the looking between the >> ioctl/ttm and the actual hardware submission and that brings it pretty close >> (if not identically) to the ring mutex. >> >> Cheers, >> Christian. > No, locking at the ioctl level make sense please show me figure that > it hurt performance, i did a quick sysprof and i couldn't see them > impacting anything. No matter how much you hate this, this is the best > solution, it avoids each ioctl to do useless things in case of gpu > lockup and it touch a lot less code than moving a lock down the call > path would. So again this is the best solution for the heavy reset, > and i am not saying that a soft reset would need to take this lock or > that we can't improve the way it's done. All i am saying is that ring > lock is the wrong solution for heavy reset, it should be ok for light > weight reset. > I'm not into any performance concerns, it just doesn't seems to be the right place to me. So are you really sure that the ttm_bo_delayed_workqueue, pageflips or callbacks to radeon_bo_move can't hit us here? IIRC that always was the big concern with the cs_mutex not being held in all cases. Anyway, if you think it will work and fix the crash problem at hand then I'm ok with commit it. Christian. <http://lxr.free-electrons.com/ident?v=2.6.37;i=ttm_bo_delayed_workqueue>