Thanks again Patryk,
For your insights, we have implemented many of the same things, but the socket
errors are still occurring regularly.
If we find a solution that works I will be sure to add it to this thread.
Many thanks
Jason
Jason Ellul
Head - Research Computing Facility
Office of Cance
Slurm User Group (SLUG) 2024 is set for September 12-13 at the
University of Oslo in Oslo, Norway.
Registration information, abstracts, and travel recommendations can be
found here:https://slug24.splashthat.com/
The last day to register with standard pricing ($900) is this Friday,
August 2nd. Aft
Can I ask if this replaces the work on "SUNK" that was previously announced?
(but never released as open-source on GitHub as was planned; looks like it is
only available on CoreWeave Cloud?)
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users
It sounds to me perhaps as though your systemd units are starting in the wrong
order, or don’t have appropriate dependencies set in them?
Tim
--
Tim Cutts
Scientific Computing Platform Lead
AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by
visiting our S
After added nvidia-persistenced service, slurm did not fail.
Thanks for your help.
On 7/29/24 13:00, Sarlo, Jeffrey S wrote:
nvidia-persistenced is something that gets installed by the nvidia
driver. Setting it to start at boot time helps with slurmd being able
to find the GPUs when it tries
nvidia-persistenced is something that gets installed by the nvidia driver.
Setting it to start at boot time helps with slurmd being able to find the GPUs
when it tries to start. This is just one web page that has some information
about it.
https://download.nvidia.com/XFree86/Linux-x86_64/396.
On Mon, 2024-07-29 at 11:23:12 +0300, Slurm users wrote:
> Hi there all,
>
> We have Dell server with 2 x Nvidia H100 and running slurm on it. After
> restart server if we do not write nvidia-smi command slurm fails. When we
> run nvidia-smi && systemctl restart slurmd && systemctl restart slurmct
Hi there all,
We have Dell server with 2 x Nvidia H100 and running slurm on it. After
restart server if we do not write nvidia-smi command slurm fails. When
we run nvidia-smi && systemctl restart slurmd && systemctl restart
slurmctld , slurm queue begins. Do you have any idea about this error
Perhaps PlannedCPURAW?
--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo
signature.asc
Description: PGP signature
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com