On 18-10-19, 06:55, Sudeep Holla wrote: > On Thu, Oct 17, 2019 at 11:26:54PM +0200, Rafael J. Wysocki wrote: > > On Thu, Oct 17, 2019 at 9:36 PM Rafael J. Wysocki <raf...@kernel.org> wrote: > > > > > > On Thu, Oct 17, 2019 at 6:35 PM Sudeep Holla <sudeep.ho...@arm.com> wrote: > > > > > > > > dev_pm_qos_remove_request ends calling {max,min}_freq_req QoS notifiers > > > > which schedule policy update work. It may end up racing with the freeing > > > > the policy and unregistering the driver. > > > > > > > > One possible race is as below where the cpufreq_driver is unregistered > > > > but the scheduled work gets executed at later stage when cpufreq_driver > > > > is NULL(i.e. after freeing the policy and driver) > > > > > > > > Unable to handle kernel NULL pointer dereference at virtual address > > > > 0000001c > > > > pgd = (ptrval) > > > > [0000001c] *pgd=80000080204003, *pmd=00000000 > > > > Internal error: Oops: 206 [#1] SMP THUMB2 > > > > Modules linked in: > > > > CPU: 0 PID: 34 Comm: kworker/0:1 Not tainted > > > > 5.4.0-rc3-00006-g67f5a8081a4b #86 > > > > Hardware name: ARM-Versatile Express > > > > Workqueue: events handle_update > > > > PC is at cpufreq_set_policy+0x58/0x228 > > > > LR is at dev_pm_qos_read_value+0x77/0xac > > > > Control: 70c5387d Table: 80203000 DAC: fffffffd > > > > Process kworker/0:1 (pid: 34, stack limit = 0x(ptrval)) > > > > (cpufreq_set_policy) from > > > > (refresh_frequency_limits.part.24+0x37/0x48) > > > > (refresh_frequency_limits.part.24) from > > > > (handle_update+0x2f/0x38) > > > > (handle_update) from (process_one_work+0x16d/0x3cc) > > > > (process_one_work) from (worker_thread+0xff/0x414) > > > > (worker_thread) from (kthread+0xff/0x100) > > > > (kthread) from (ret_from_fork+0x11/0x28) > > > > > > > > Cc: "Rafael J. Wysocki" <r...@rjwysocki.net> > > > > Cc: Viresh Kumar <viresh.ku...@linaro.org> > > > > Signed-off-by: Sudeep Holla <sudeep.ho...@arm.com> > > > > --- > > > > drivers/cpufreq/cpufreq.c | 3 +++ > > > > 1 file changed, 3 insertions(+) > > > > > > > > Hi Rafael, Viresh, > > > > > > > > This fixed the boot issue I reported[1] on TC2 with bL switcher enabled. > > > > I have based this patch on -rc3 and not on top of your patches. This > > > > only fixes the boot issue but I hit the other crashes while continuously > > > > switching on and off the bL switcher that register/unregister the driver > > > > Your patch series fixes them. I can based this on top of those if you > > > > prefer. > > > > > > > > Regards, > > > > Sudeep > > > > > > > > [1] https://lore.kernel.org/linux-pm/20191015155735.GA29105@bogus/ > > > > > > > > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c > > > > index c52d6fa32aac..b703c29a84be 100644 > > > > --- a/drivers/cpufreq/cpufreq.c > > > > +++ b/drivers/cpufreq/cpufreq.c > > > > @@ -1278,6 +1278,9 @@ static void cpufreq_policy_free(struct > > > > cpufreq_policy *policy) > > > > } > > > > > > > > dev_pm_qos_remove_request(policy->min_freq_req); > > > > + /* flush the pending policy->update work before freeing the > > > > policy */ > > > > + if (work_pending(&policy->update)) > > > > > > Isn't this racy? > > > > > > It still may be running if the pending bit is clear and we still need > > > to wait for it then, don't we? > > > > > > Why don't you do an unconditional flush_work() here? > > > > You may as well do a cancel_work_sync() here, because whether or not > > the last update of the policy happens before it goes away is a matter > > of timing in any case > > In fact that's the first thing I tried to fix the issue I was seeing. > But I then thought it would be better to complete the update as the PM > QoS were getting updated back to DEFAULT values for the device. Even > this works. > > What is your preference ? flush_work or cancel_work_sync ? I will > update accordingly. I may need to do some more testing with > cancel_work_sync as I just checked that quickly to confirm the race.
As I said in the other email, this work didn't come as a result of removal of the qos request from cpufreq core and so must have come from other thermal or similar events. In that case maybe doing flush_work() is better before we remove the cpufreq driver. Though Rafael's timing related comment makes sense as well, but now that we have received the work before policy is removed, I will rather complete the work and quit. -- viresh