On 2021-08-13 6:23 a.m., Lazar, Lijo wrote: > > > On 8/12/2021 10:24 PM, Michel Dänzer wrote: >> On 2021-08-12 1:33 p.m., Lazar, Lijo wrote: >>> On 8/12/2021 1:41 PM, Michel Dänzer wrote: >>>> On 2021-08-12 7:55 a.m., Koenig, Christian wrote: >>>>> Hi James, >>>>> >>>>> Evan seems to have understood how this all works together. >>>>> >>>>> See while any begin/end use critical section is active the work should >>>>> not be active. >>>>> >>>>> When you handle only one ring you can just call cancel in begin use and >>>>> schedule in end use. But when you have more than one ring you need a lock >>>>> or counter to prevent concurrent work items to be started. >>>>> >>>>> Michelle's idea to use mod_delayed_work is a bad one because it assumes >>>>> that the delayed work is still running. >>>> >>>> It merely assumes that the work may already have been scheduled before. >>>> >>>> Admittedly, I missed the cancel_delayed_work_sync calls for patch 2. While >>>> I think it can still have some effect when there's a single work item for >>>> multiple rings, as described by James, it's probably negligible, since >>>> presumably the time intervals between ring_begin_use and ring_end_use are >>>> normally much shorter than a second. >>>> >>>> So, while patch 2 is at worst a no-op (since mod_delayed_work is the same >>>> as schedule_delayed_work if the work hasn't been scheduled yet), I'm fine >>>> with dropping it. >>>> >>>> >>>>> Something similar applies to the first patch I think, >>>> >>>> There are no cancel work calls in that case, so the commit log is accurate >>>> TTBOMK. >>> >>> Curious - >>> >>> For patch 1, does it make a difference if any delayed work scheduled is >>> cancelled in the else part before proceeding? >>> >>> } else if (!enable && adev->gfx.gfx_off_state) { >>> cancel_delayed_work(); >> >> I tried the patch below. >> >> While this does seem to fix the problem as well, I see a potential issue: >> >> 1. amdgpu_gfx_off_ctrl locks adev->gfx.gfx_off_mutex >> 2. amdgpu_device_delay_enable_gfx_off runs, blocks in mutex_lock >> 3. amdgpu_gfx_off_ctrl calls cancel_delayed_work_sync >> >> I'm afraid this would deadlock? (CONFIG_PROVE_LOCKING doesn't complain >> though) > > Should use the cancel_delayed_work instead of the _sync version.
The thing is, it's not clear to me from cancel_delayed_work's description that it's guaranteed not to wait for amdgpu_device_delay_enable_gfx_off to finish if it's already running. If that's not guaranteed, it's prone to the same deadlock. > As you mentioned - at best work is not scheduled yet and cancelled > successfully, or at worst it's waiting for the mutex. In the worst case, if > amdgpu_device_delay_enable_gfx_off gets the mutex after amdgpu_gfx_off_ctrl > unlocks it, there is an extra check as below. > > if (!adev->gfx.gfx_off_state && !adev->gfx.gfx_off_req_count) > > The count wouldn't be 0 and hence it won't enable GFXOFF. I'm not sure, but it might also be possible for amdgpu_device_delay_enable_gfx_off to get the mutex only after amdgpu_gfx_off_ctrl was called again and set adev->gfx.gfx_off_req_count back to 0. >> Maybe it's possible to fix it with cancel_delayed_work_sync somehow, but I'm >> not sure how offhand. (With cancel_delayed_work instead, I'm worried >> amdgpu_device_delay_enable_gfx_off might still enable GFXOFF in the HW >> immediately after amdgpu_gfx_off_ctrl unlocks the mutex. Then again, that >> might happen with mod_delayed_work as well...) > > As mentioned earlier, cancel_delayed_work won't cause this issue. > > In the mod_delayed_ patch, mod_ version is called only when req_count is 0. > While that is a good thing, it keeps alive one more contender for the mutex. Not sure what you mean. It leaves the possibility of amdgpu_device_delay_enable_gfx_off running just after amdgpu_gfx_off_ctrl tried to postpone it. As discussed above, something similar might be possible with cancel_delayed_work as well. > The cancel_ version eliminates that contender if happens to be called at the > right time (more likely if there are multiple requests to disable gfxoff). On > the other hand, don't know how costly it is to call cancel_ every time on the > else part (or maybe call only once when count increments to 1?). Sure, why not, though I doubt it matters much — I expect adev->gfx.gfx_off_req_count transitioning between 0 <-> 1 to be the most common case by far. I sent out a v2 patch which should address all these issues. -- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and X developer