On Wed, Dec 31, 2025 at 5:40 PM Yafang Shao <[email protected]> wrote: > > We maintain a vmcore analysis script on each server that automatically > parses /var/crash/XXXX/vmcore-dmesg.txt to categorize vmcores. This helps > us save considerable effort by avoiding analysis of known bugs. > > For vmcores triggered by a driver bug, the system calls print_modules() to > list the loaded modules. However, print_modules() does not output module > version information. Across a large fleet of servers, there are often many > different module versions running simultaneously, and we need to know which > driver version caused a given vmcore. > > Currently, the only reliable way to obtain the module version associated > with a vmcore is to analyze the /var/crash/XXXX/vmcore file itself—an > operation that is resource-intensive. Therefore, we propose printing the > driver version directly in the log, which is far more efficient. > > The motivation behind this change is that the external NVIDIA driver > [0] frequently causes kernel panics across our server fleet. > While we continuously upgrade to newer NVIDIA driver versions, > upgrading the entire fleet is time-consuming. Therefore, we need to > identify which driver version is responsible for each panic. > > In-tree modules are tied to the specific kernel version already, so > printing their versions is redundant. However, for external drivers (like > proprietary networking or GPU stacks), the version is the single most > critical piece of metadata for triage. Therefore, to avoid bloating the > information about loaded modules, we only print the version for external > modules. > > - Before this patch > > Modules linked in: mlx5_core(O) nvidia(PO) nvme_core > > - After this patch > > Modules linked in: mlx5_core-5.8-2.0.3(O) nvidia-535.274.02(PO) nvme_core > ^^^^^^^^^^ ^^^^^^^^^^^ > > Note: nvme_core is a in-tree module[1], so its version isn't printed. > > Link: https://github.com/NVIDIA/open-gpu-kernel-modules/tags [0] > Link: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/core.c?h=v6.19-rc3#n5448 > [1] > Suggested-by: Petr Pavlu <[email protected]> > Reviewed-by: Aaron Tomlin <[email protected]> > Signed-off-by: Yafang Shao <[email protected]> > --- > kernel/module/main.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > --- > v1->v2: > - print it for external module only (Petr, Aaron) > - add comment for it (Aaron) > > diff --git a/kernel/module/main.c b/kernel/module/main.c > index 710ee30b3bea..16263ce23e92 100644 > --- a/kernel/module/main.c > +++ b/kernel/module/main.c > @@ -3901,7 +3901,11 @@ void print_modules(void) > list_for_each_entry_rcu(mod, &modules, list) { > if (mod->state == MODULE_STATE_UNFORMED) > continue; > - pr_cont(" %s%s", mod->name, module_flags(mod, buf, true)); > + pr_cont(" %s", mod->name); > + /* Only append version for out-of-tree modules */ > + if (mod->version && test_bit(TAINT_OOT_MODULE, &mod->taints)) > + pr_cont("-%s", mod->version); > + pr_cont("%s", module_flags(mod, buf, true)); > } > > print_unloaded_tainted_modules(); > -- > 2.43.5 >
Just checking in on this patch. It looks like it hasn't been merged yet. Is it good to go, or does it need something else? -- Regards Yafang
