"Paul E. McKenney" <paul...@kernel.org> writes: > On Sun, Apr 10, 2022 at 09:33:43PM +1000, Michael Ellerman wrote: >> Zhouyi Zhou <zhouzho...@gmail.com> writes: >> > On Fri, Apr 8, 2022 at 10:07 PM Paul E. McKenney <paul...@kernel.org> >> > wrote: >> >> On Fri, Apr 08, 2022 at 06:02:19PM +0800, Zhouyi Zhou wrote: >> >> > On Fri, Apr 8, 2022 at 3:23 PM Michael Ellerman <m...@ellerman.id.au> >> >> > wrote: >> ... >> >> > > I haven't seen it in my testing. But using Miguel's config I can >> >> > > reproduce it seemingly on every boot. >> >> > > >> >> > > For me it bisects to: >> >> > > >> >> > > 35de589cb879 ("powerpc/time: improve decrementer clockevent >> >> > > processing") >> >> > > >> >> > > Which seems plausible. >> >> > I also bisect to 35de589cb879 ("powerpc/time: improve decrementer >> >> > clockevent processing") >> ... >> >> >> >> > > Reverting that on mainline makes the bug go away. >> >> >> > I also revert that on the mainline, and am currently doing a pressure >> >> > test (by repeatedly invoking qemu and checking the console.log) on PPC >> >> > VM in Oregon State University. >> >> > After 306 rounds of stress test on mainline without triggering the bug >> > (last for 4 hours and 27 minutes), I think the bug is indeed caused by >> > 35de589cb879 ("powerpc/time: improve decrementer clockevent >> > processing") and stop the test for now. >> >> Thanks for testing, that's pretty conclusive. >> >> I'm not inclined to actually revert it yet. >> >> We need to understand if there's actually a bug in the patch, or if it's >> just exposing some existing bug/bad behavior we have. The fact that it >> only appears with CONFIG_HIGH_RES_TIMERS=n is suspicious. >> >> Do we have some code that inadvertently relies on something enabled by >> HIGH_RES_TIMERS=y, or do we have a bug that is hidden by HIGH_RES_TIMERS=y ? > > For whatever it is worth, moderate rcutorture runs to completion without > errors with CONFIG_HIGH_RES_TIMERS=n on 64-bit x86.
Thanks for testing that, I don't have any big x86 machines to test on :) > Also for whatever it is worth, I don't know of anything other than > microcontrollers or the larger IoT devices that would want their kernels > built with CONFIG_HIGH_RES_TIMERS=n. Which might be a failure of > imagination on my part, but so it goes. Yeah I agree, like I said before I wasn't even aware you could turn it off. So I think we'll definitely add a select HIGH_RES_TIMERS in future, but first I need to work out why we are seeing stalls with it disabled. cheers