Hello,

Does someone have idea why "slurmd -C" crashes when it unloads gpu_nrt.so with 
latest NVIDIA drivers (570 and 575)? We checked, there is no crash in cuda at 
the moment and gpu_nvml.so works fine, all nvml calls finish successfully, 
dlclose on gpu_nvml.so works fine. The crash does not depend whether real GPUs 
present or not.

Steps to reproduce:

  1.  Install Ubuntu 24.04
  2.  wget 
https://download.schedmd.com/slurm/slurm-24.11.4.tar.bz2<https://download.schedmd.com/slurm/slurm-24.11.4.tar.bz2;>
  3.  tar fx ./slurm-24.11.4.tar.bz2
  4.  cd slurm-24.11.4
  5.
apt-get install cuda-12-8 hwloc libmunge-dev -y
  6.  ./configure
  7.
make && make install
  8.
Run "slurmd -C", or sometimes "slurmd -vvv -C" to get the crash.

Stack trace:
                #0  0x0000155555544b2a strlen (ld-linux-x86-64.so.2 + 0x28b2a)
                #1  0x000015555551fc08 __GI__dl_exception_create 
(ld-linux-x86-64.so.2 + 0x3c08)
                #2  0x000015555551d298 __GI__dl_signal_error 
(ld-linux-x86-64.so.2 + 0x1298)
                #3  0x000015555551e81d _dl_close (ld-linux-x86-64.so.2 + 0x281d)
                #4  0x000015555551d51c __GI__dl_catch_exception 
(ld-linux-x86-64.so.2 + 0x151c)
                #5  0x000015555551d669 _dl_catch_error (ld-linux-x86-64.so.2 + 
0x1669)
                #6  0x0000155554e97c73 _dlerror_run (libc.so.6 + 0x97c73)
                #7  0x0000155554e979a6 __dlclose (libc.so.6 + 0x979a6)
                #8  0x0000155555388a25 gpu_plugin_fini (libslurmfull.so + 
0x188a25)
                #9  0x000015555538f2ef gres_get_autodetected_gpus 
(libslurmfull.so + 0x18f2ef)
                #10 0x0000555555564828 _print_config (slurmd + 0x10828)
                #11 0x0000155554e2a1ca __libc_start_call_main (libc.so.6 + 
0x2a1ca)
                #12 0x0000155554e2a28b __libc_start_main_impl (libc.so.6 + 
0x2a28b)
                #13 0x000055555555fc75 _start (slurmd + 0xbc75)

I don't really think the problem is in gpu_nrt itself, seems the problem is in 
memory corruption somewhere else, but I am not sure. The issue is reproduced 
constantly. A

Best regards,

Taras
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to