On Mon, 2024-07-29 at 11:23:12 +0300, Slurm users wrote: > Hi there all, > > We have Dell server with 2 x Nvidia H100 and running slurm on it. After > restart server if we do not write nvidia-smi command slurm fails. When we > run nvidia-smi && systemctl restart slurmd && systemctl restart slurmctld , > slurm queue begins. Do you have any idea about this error and what can we do > for this issue?
Apparently the nvidia driver doesn't get loaded on reboot? There are multiple ways - add to /etc/modules, run modprobe nvidia via a @reboot crontab entry (or even run nvidia-smi in this way)... Best, Steffen -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com