On Wed, Mar 31, 2021 at 9:31 AM Peter Zijlstra <pet...@infradead.org> wrote: > > On Wed, Mar 31, 2021 at 08:11:38AM +0200, Dmitry Vyukov wrote: > > On Wed, Mar 31, 2021 at 12:26 AM syzbot > > <syzbot+1a33233ccd8201ec2...@syzkaller.appspotmail.com> wrote: > > > > > > Hello, > > > > > > syzbot found the following issue on: > > > > > > HEAD commit: db24726b Merge tag 'integrity-v5.12-fix' of > > > git://git.kern.. > > > git tree: upstream > > > console output: https://syzkaller.appspot.com/x/log.txt?x=16c16b7cd00000 > > > kernel config: https://syzkaller.appspot.com/x/.config?x=daeff30c2474a60f > > > dashboard link: > > > https://syzkaller.appspot.com/bug?extid=1a33233ccd8201ec2322 > > > > > > Unfortunately, I don't have any reproducer for this issue yet. > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the > > > commit: > > > Reported-by: syzbot+1a33233ccd8201ec2...@syzkaller.appspotmail.com > > > > I think this is a LOCKDEP issue. +LOCKDEP maintainers. > > > > Another bug happened on another thread ("WARNING: possible circular > > locking dependency detected"). Lockdep disabled lock tracking > > ("debug_locks = 0" in the report), which probably made it miss > > rcu_unlock somewhere, but it did not turn off reporting yet and > > produced the false positive first. > > > > I think if LOCKDEP disables lock tracking, it must also disable > > reporting of issues that require lock tracking. That would avoid false > > positives. > > Still early and brain hasn't really booted yet, but features that > require lock tracking are supposed to check debug_locks. > > And afaict debug_lockdep_rcu_enabled(), which is called by > RCU_LOCKDEP_WARN(), which is called by rcu_sleep_check() does just that.
Right... yet it somehow happens. Looking at a dozen of reports, all with 2 concurrent lockdep splats and "debug_locks = 0" in the report, I am pretty sure there is some kind of race in lockdep. I see there are at least 2 places where lockdep can falsely assume rcu lock is held: https://elixir.bootlin.com/linux/v5.12-rc5/source/kernel/locking/lockdep.c#L5543 https://elixir.bootlin.com/linux/v5.12-rc5/source/kernel/rcu/update.c#L105 both to "avoid false positives", but for "Illegal context switch in RCU-bh read-side critical section" it can actually lead to false positives, right? Is there something else that turns off tracking before setting debug_locks=0? Perhaps we get into that window where tracking is disabled, but debug_locks is not reset yet? lockdep_enabled() returns false if lockdep_recursion var is set: https://elixir.bootlin.com/linux/v5.12-rc5/source/kernel/locking/lockdep.c#L87 but lockdep_lock() sets it _before_ taking the lock: https://elixir.bootlin.com/linux/v5.12-rc5/source/kernel/locking/lockdep.c#L111 Is it possible that lockdep_recursion is set, then the task is rescheduled and another task sees wrong value for lockdep_recursion? Shouldn't lockdep_recursion be set _after_ arch_spin_unlock(&__lock)? Though, I assume lockdep_lock() is called frequently and not only on reports, so if my reasoning would be true, it would produce false positives all the time, not necessary on concurrent reports... this does not agree with the observed failure mode...