On Fri 2026-06-26 13:32:38, Bradley Morgan wrote:
> On June 26, 2026 1:17:13 PM GMT+01:00, Bradley Morgan <[email protected]>
> wrote:
> >On June 26, 2026 1:14:14 PM GMT+01:00, Petr Mladek <[email protected]>
> >wrote:
> >>On Fri 2026-06-26 12:23:50, Petr Mladek wrote:
> >>> On Thu 2026-06-25 15:25:58, Bradley Morgan wrote:
> >>> But it all becomes very hairy. We have several levels:
> >>>
> >>> + watchdog-all_bt-specific option, e.g.
> >>sysctl_hardlockup_all_cpu_backtrace
> >>>
> >>> + watchdog-specific si_info preferences, e.g. hardlockup_si_mask
> >>>
> >>> + panic-specific si_info: panic_print
> >>>
> >>> + universal fallback for any layer: kernel_si_info
> >>>
> >>> Now, we try to check all these variables back and forth to
> >>> trigger all backtraces or to avoid triggering them.
> >>> And it clearly does not work well and the code is more and more
> >>> hairy.
> >>>
> >>> I think about another approach. The word "waterfall" comes to my mind.
> >>> Instead of checking all the settings back and forth, let's process
> >>> each setting one by one and just remember what has been done and
> >>> skip this in the next level.
> >>>
> >>> All the si_info actions seems to dump a global system state.
> >>> So, it would make sense to remember the state in a global variable
> >>> even when it might be modified by more CPUs in parallel.
> >>>
> Hmm.. new idea
>
> kernel/dump_filter.c ?
>
> What this file could do is to handle a generic lockup state machine
> so any subsystem can log what it already dumped?
>
> I know it may bloat, but it's better then cramming fixes in.
I am not sure what exactly you would like to achieve but it sounds
a bit scary ;-)
Anyway, we should not synchronize the watchdog reports against
each other, definitely. They are running in non-compatible contexts
(task vs interrupt vs NMI). Also we should not add any locking
because they usually print something when the system has enough
troubles.
Also I think that it is not worth preventing duplicated backtraces
or reports from a single CPU. IMHO, it is not a big problem
in practice.
So, we are down to large reports, like backtraces from all CPUs,
timers, locks, ... which are handled by sys_info(). So, I think
that it should be enough to handle this inside the sys_info() API.
I do not want to say that my proposal was the best solution.
I am sure that there are better ones. But we need to consider
the gain vs. complexity.
Honestly, I am already a bit scared by the complexity which
we the sys_info() API added. And it is hard to imagine that
adding another API would make it easier. But I might be wrong.
Instead, it might make sense to integrate the conflicting
subsystem-specific calls under the sys_info() API.
I mean that, for example watchdog_hardlockup_check() won't
call trigger_allbutcpu_cpu_backtrace() directly but
it would call it via sys_info() API so that sys_info()
could keep track of it. Something like:
void sys_info_allbutcpu_bt(int cpu)
{
trigger_allbutcpu_cpu_backtrace(cpu);
/*
* The caller likely printed backtrace of the given @cpu
* on its own. Prevent duplicate backtraces from all
* CPUs with potential next sys_info() call.
*/
sys_info_done(SYS_INFO_ALL_BT);
}
But I am not sure if it is really easier to follow
than calling sys_info_done() from the watchdog code.
Some watchdogs try to optimize the output and print backtraces
only from CPUs which are relevant for the given lockup.
We should keep the logic for selecting the set of CPUs
in the watchdog code. We just need to solve how to elegantly
make sys_info() aware of it or at least about the more massive
reports.
Anyway, I would prefer to keep it simple until we see some problems
in practice.
Best Regards,
Petr