On Mon, 2024-07-29 at 11:23:12 +0300, Slurm users wrote:
> Hi there all,
> We have Dell server with 2 x Nvidia H100 and running slurm on it. After
> restart server if we do not write nvidia-smi command slurm fails. When we
> run nvidia-smi && systemctl restart slurmd && systemctl restart slurmctld ,
> slurm queue begins. Do you have any idea about this error and what can we do
> for this issue?

Apparently the nvidia driver doesn't get loaded on reboot?
There are multiple ways - add to /etc/modules, run modprobe nvidia via
a @reboot crontab entry (or even run nvidia-smi in this way)...


Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de

slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to