[slurm-users] Re: Slurm fails before nvidia-smi command

2024-07-29 Thread Cutts, Tim via slurm-users
It sounds to me perhaps as though your systemd units are starting in the wrong order, or don’t have appropriate dependencies set in them? Tim -- Tim Cutts Scientific Computing Platform Lead AstraZeneca Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our S

[slurm-users] Re: Slurm fails before nvidia-smi command

2024-07-29 Thread Aziz Ogutlu via slurm-users
After added nvidia-persistenced service, slurm did not fail. Thanks for your help. On 7/29/24 13:00, Sarlo, Jeffrey S wrote: nvidia-persistenced is something that gets installed by the nvidia driver.  Setting it to start at boot time helps with slurmd being able to find the GPUs when it tries

[slurm-users] Re: Slurm fails before nvidia-smi command

2024-07-29 Thread Sarlo, Jeffrey S via slurm-users
nvidia-persistenced is something that gets installed by the nvidia driver. Setting it to start at boot time helps with slurmd being able to find the GPUs when it tries to start. This is just one web page that has some information about it. https://download.nvidia.com/XFree86/Linux-x86_64/396.

[slurm-users] Re: Slurm fails before nvidia-smi command

2024-07-29 Thread Steffen Grunewald via slurm-users
On Mon, 2024-07-29 at 11:23:12 +0300, Slurm users wrote: > Hi there all, > > We have Dell server with 2 x Nvidia H100 and running slurm on it. After > restart server if we do not write nvidia-smi command slurm fails. When we > run nvidia-smi && systemctl restart slurmd && systemctl restart slurmct