We maintain a vmcore analysis script on each server that automatically
parses /var/crash/XXXX/vmcore-dmesg.txt to categorize vmcores. This helps
us save considerable effort by avoiding analysis of known bugs.

For vmcores triggered by a driver bug, the system calls print_modules() to
list the loaded modules. However, print_modules() does not output module
version information. Across a large fleet of servers, there are often many
different module versions running simultaneously, and we need to know which
driver version caused a given vmcore.

Currently, the only reliable way to obtain the module version associated
with a vmcore is to analyze the /var/crash/XXXX/vmcore file itself—an
operation that is resource-intensive. Therefore, we propose printing the
driver version directly in the log, which is far more efficient.

The motivation behind this change is that the external NVIDIA driver
[0] frequently causes kernel panics across our server fleet.
While we continuously upgrade to newer NVIDIA driver versions,
upgrading the entire fleet is time-consuming. Therefore, we need to
identify which driver version is responsible for each panic.

In-tree modules are tied to the specific kernel version already, so
printing their versions is redundant. However, for external drivers (like
proprietary networking or GPU stacks), the version is the single most
critical piece of metadata for triage. Therefore, to avoid bloating the
information about loaded modules, we only print the version for external
modules.

- Before this patch

  Modules linked in: mlx5_core(O) nvidia(PO) nvme_core

- After this patch

  Modules linked in: mlx5_core-5.8-2.0.3(O) nvidia-535.274.02(PO) nvme_core
                              ^^^^^^^^^^          ^^^^^^^^^^^

  Note: nvme_core is a in-tree module[1], so its version isn't printed.

Link: https://github.com/NVIDIA/open-gpu-kernel-modules/tags [0]
Link: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/core.c?h=v6.19-rc3#n5448
 [1]
Suggested-by: Petr Pavlu <[email protected]>
Reviewed-by: Aaron Tomlin <[email protected]>
Signed-off-by: Yafang Shao <[email protected]>
---
 kernel/module/main.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

---
v1->v2: 
- print it for external module only (Petr, Aaron)
- add comment for it (Aaron)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index 710ee30b3bea..16263ce23e92 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -3901,7 +3901,11 @@ void print_modules(void)
        list_for_each_entry_rcu(mod, &modules, list) {
                if (mod->state == MODULE_STATE_UNFORMED)
                        continue;
-               pr_cont(" %s%s", mod->name, module_flags(mod, buf, true));
+               pr_cont(" %s", mod->name);
+               /* Only append version for out-of-tree modules */
+               if (mod->version && test_bit(TAINT_OOT_MODULE, &mod->taints))
+                       pr_cont("-%s", mod->version);
+               pr_cont("%s", module_flags(mod, buf, true));
        }
 
        print_unloaded_tainted_modules();
-- 
2.43.5


Reply via email to