On June 26, 2026 3:26:11 PM GMT+01:00, Petr Mladek <[email protected]>
wrote:
>On Fri 2026-06-26 13:32:38, Bradley Morgan wrote:
>> On June 26, 2026 1:17:13 PM GMT+01:00, Bradley Morgan
><[email protected]>
>> wrote:
>> >On June 26, 2026 1:14:14 PM GMT+01:00, Petr Mladek <[email protected]>
>> >wrote:
>> >>On Fri 2026-06-26 12:23:50, Petr Mladek wrote:
>> >>> On Thu 2026-06-25 15:25:58, Bradley Morgan wrote:
>> >>> But it all becomes very hairy. We have several levels:
>> >>> 
>> >>>    + watchdog-all_bt-specific option, e.g.
>> >>sysctl_hardlockup_all_cpu_backtrace
>> >>> 
>> >>>    + watchdog-specific si_info preferences, e.g. hardlockup_si_mask
>> >>> 
>> >>>    + panic-specific si_info: panic_print
>> >>> 
>> >>>    + universal fallback for any layer: kernel_si_info
>> >>> 
>> >>> Now, we try to check all these variables back and forth to
>> >>> trigger all backtraces or to avoid triggering them.
>> >>> And it clearly does not work well and the code is more and more
>> >>> hairy.
>> >>> 
>> >>> I think about another approach. The word "waterfall" comes to my
>mind.
>> >>> Instead of checking all the settings back and forth, let's process
>> >>> each setting one by one and just remember what has been done and
>> >>> skip this in the next level.
>> >>> 
>> >>> All the si_info actions seems to dump a global system state.
>> >>> So, it would make sense to remember the state in a global variable
>> >>> even when it might be modified by more CPUs in parallel.
>> >>> 
>> Hmm.. new idea 
>> 
>> kernel/dump_filter.c ?
>> 
>> What this file could do is to handle a generic lockup state machine
>> so any subsystem can log what it already dumped?
>> 
>> I know it may bloat, but it's better then cramming fixes in.
>
>I am not sure what exactly you would like to achieve but it sounds
>a bit scary ;-)
>
>Anyway, we should not synchronize the watchdog reports against
>each other, definitely. They are running in non-compatible contexts
>(task vs interrupt vs NMI). Also we should not add any locking
>because they usually print something when the system has enough
>troubles.
>
>Also I think that it is not worth preventing duplicated backtraces
>or reports from a single CPU. IMHO, it is not a big problem
>in practice.
>
>So, we are down to large reports, like backtraces from all CPUs,
>timers, locks, ... which are handled by sys_info(). So, I think
>that it should be enough to handle this inside the sys_info() API.
>
>I do not want to say that my proposal was the best solution.
>I am sure that there are better ones. But we need to consider
>the gain vs. complexity.
>
>Honestly, I am already a bit scared by the complexity which
>we the sys_info() API added. And it is hard to imagine that
>adding another API would make it easier. But I might be wrong.
>
>Instead, it might make sense to integrate the conflicting
>subsystem-specific calls under the sys_info() API.
>I mean that, for example watchdog_hardlockup_check() won't
>call trigger_allbutcpu_cpu_backtrace() directly but
>it would call it via sys_info() API so that sys_info()
>could keep track of it. Something like:
>
>void sys_info_allbutcpu_bt(int cpu)
>{
>       trigger_allbutcpu_cpu_backtrace(cpu);
>       /*
>        * The caller likely printed backtrace of the given @cpu
>        * on its own. Prevent duplicate backtraces from all
>        * CPUs with potential next sys_info() call.
>        */
>       sys_info_done(SYS_INFO_ALL_BT);
>}
>
>But I am not sure if it is really easier to follow
>than calling sys_info_done() from the watchdog code.
>
>Some watchdogs try to optimize the output and print backtraces
>only from CPUs which are relevant for the given lockup.
>We should keep the logic for selecting the set of CPUs
>in the watchdog code. We just need to solve how to elegantly
>make sys_info() aware of it or at least about the more massive
>reports.
>
>Anyway, I would prefer to keep it simple until we see some problems
>in practice.
>
>Best Regards,
>Petr
>


I understand it's scary. To make a new file in the first place.

But I was a bit vague of what I wanted, and I'm sorry.

So, the reason why I'd suggest a new file, is because if any subsystem
Theoretically bypasses sys_info to log a lockup, this completely misses
the filter and duplicates the dump

My file would act as a generic lockless state machine that any
subsystem can update regardless of how they dump logs.

If you have any questions, feel absolutely free to ask! :)

Discussion is a way to make everyone happy!

Thanks!

Reply via email to