On Fri 2021-03-19 16:00:36, Wang Qing wrote: > When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu > updated, while the unbound worker_pool running on its core uses > wq_watchdog_touched to determine whether locked up. This may be mischecked.
By other words, unbound workqueues are not aware of the more common touch_softlockup_watchdog() because it updates only wq_watchdog_touched_cpu for the affected CPU. As a result, the workqueue watchdog might report lockup in unbound workqueue even though it is blocked by a known slow code. > My suggestion is to update both when touch_softlockup_watchdog() is called, > use wq_watchdog_touched_cpu to check bound, and use wq_watchdog_touched > to check unbound worker_pool. > > Signed-off-by: Wang Qing <wangq...@vivo.com> > --- > kernel/watchdog.c | 5 +++-- > kernel/workqueue.c | 17 ++++++----------- > 2 files changed, 9 insertions(+), 13 deletions(-) > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c > index 7110906..107bc38 > --- a/kernel/watchdog.c > +++ b/kernel/watchdog.c > @@ -278,9 +278,10 @@ void touch_all_softlockup_watchdogs(void) > * update as well, the only side effect might be a cycle delay for > * the softlockup check. > */ > - for_each_cpu(cpu, &watchdog_allowed_mask) > + for_each_cpu(cpu, &watchdog_allowed_mask) { > per_cpu(watchdog_touch_ts, cpu) = SOFTLOCKUP_RESET; > - wq_watchdog_touch(-1); > + wq_watchdog_touch(cpu); Note that wq_watchdog_touch(cpu) newly always updates wq_watchdog_touched. This cycle will set the same jiffies value cpu-times to the same variable. > + } > } > > void touch_softlockup_watchdog_sync(void) > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 0d150da..be08295 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -5787,22 +5787,17 @@ static void wq_watchdog_timer_fn(struct timer_list > *unused) > continue; > > /* get the latest of pool and touched timestamps */ > + if (pool->cpu >= 0) > + touched = READ_ONCE(per_cpu(wq_watchdog_touched_cpu, > pool->cpu)); > + else > + touched = READ_ONCE(wq_watchdog_touched); > pool_ts = READ_ONCE(pool->watchdog_ts); > - touched = READ_ONCE(wq_watchdog_touched); > > if (time_after(pool_ts, touched)) > ts = pool_ts; > else > ts = touched; > > - if (pool->cpu >= 0) { > - unsigned long cpu_touched = > - READ_ONCE(per_cpu(wq_watchdog_touched_cpu, > - pool->cpu)); > - if (time_after(cpu_touched, ts)) > - ts = cpu_touched; > - } > - > /* did we stall? */ > if (time_after(jiffies, ts + thresh)) { > lockup_detected = true; > @@ -5826,8 +5821,8 @@ notrace void wq_watchdog_touch(int cpu) > { > if (cpu >= 0) > per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies; > - else > - wq_watchdog_touched = jiffies; > + > + wq_watchdog_touched = jiffies; > } > > static void wq_watchdog_set_thresh(unsigned long thresh) This last hunk is enough to fix the problem. wq_watchdog_touched will get updated also from cpu-specific touch_softlockup_watchdog(). The original patch simplified the logic of wq_watchdog_timer_fn(). But it added un-necessary assignments into touch_all_softlockup_watchdogs(void). I do not have strong opinion what solution is better. I slightly prefer to keep only this last hunk. Best Regards, Petr