On Wed, Dec 31, 2025 at 5:40 PM Yafang Shao <[email protected]> wrote:
>
> We maintain a vmcore analysis script on each server that automatically
> parses /var/crash/XXXX/vmcore-dmesg.txt to categorize vmcores. This helps
> us save considerable effort by avoiding analysis of known bugs.
>
> For vmcores triggered by a driver bug, the system calls print_modules() to
> list the loaded modules. However, print_modules() does not output module
> version information. Across a large fleet of servers, there are often many
> different module versions running simultaneously, and we need to know which
> driver version caused a given vmcore.
>
> Currently, the only reliable way to obtain the module version associated
> with a vmcore is to analyze the /var/crash/XXXX/vmcore file itself—an
> operation that is resource-intensive. Therefore, we propose printing the
> driver version directly in the log, which is far more efficient.
>
> The motivation behind this change is that the external NVIDIA driver
> [0] frequently causes kernel panics across our server fleet.
> While we continuously upgrade to newer NVIDIA driver versions,
> upgrading the entire fleet is time-consuming. Therefore, we need to
> identify which driver version is responsible for each panic.
>
> In-tree modules are tied to the specific kernel version already, so
> printing their versions is redundant. However, for external drivers (like
> proprietary networking or GPU stacks), the version is the single most
> critical piece of metadata for triage. Therefore, to avoid bloating the
> information about loaded modules, we only print the version for external
> modules.
>
> - Before this patch
>
>   Modules linked in: mlx5_core(O) nvidia(PO) nvme_core
>
> - After this patch
>
>   Modules linked in: mlx5_core-5.8-2.0.3(O) nvidia-535.274.02(PO) nvme_core
>                               ^^^^^^^^^^          ^^^^^^^^^^^
>
>   Note: nvme_core is a in-tree module[1], so its version isn't printed.
>
> Link: https://github.com/NVIDIA/open-gpu-kernel-modules/tags [0]
> Link: 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/nvme/host/core.c?h=v6.19-rc3#n5448
>  [1]
> Suggested-by: Petr Pavlu <[email protected]>
> Reviewed-by: Aaron Tomlin <[email protected]>
> Signed-off-by: Yafang Shao <[email protected]>
> ---
>  kernel/module/main.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> ---
> v1->v2:
> - print it for external module only (Petr, Aaron)
> - add comment for it (Aaron)
>
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index 710ee30b3bea..16263ce23e92 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -3901,7 +3901,11 @@ void print_modules(void)
>         list_for_each_entry_rcu(mod, &modules, list) {
>                 if (mod->state == MODULE_STATE_UNFORMED)
>                         continue;
> -               pr_cont(" %s%s", mod->name, module_flags(mod, buf, true));
> +               pr_cont(" %s", mod->name);
> +               /* Only append version for out-of-tree modules */
> +               if (mod->version && test_bit(TAINT_OOT_MODULE, &mod->taints))
> +                       pr_cont("-%s", mod->version);
> +               pr_cont("%s", module_flags(mod, buf, true));
>         }
>
>         print_unloaded_tainted_modules();
> --
> 2.43.5
>

Just checking in on this patch. It looks like it hasn't been merged
yet. Is it good to go, or does it need something else?

-- 
Regards
Yafang

Reply via email to