Hi,
I am using an old slurm version 20.11.8 and we had to reboot our cluster
today for maintenance. I suspended all the jobs on it with the command
scontrol suspend list_job_ids and all the jobs paused and were suspended.
However, when I tried to resume them after the reboot, scontrol resume did
,
*Fritz Ratnasamy*
Data Scientist
Information Technology
On Thu, Jun 6, 2024 at 2:11 PM Ratnasamy, Fritz via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> As admin on the cluster, we do not observe any issue on our newly added
> gpu nodes.
> However, for regular users, they
As admin on the cluster, we do not observe any issue on our newly added gpu
nodes.
However, for regular users, they are not seeing their jobs running on these
gpu nodes when running squeue -u ( it is however showing as
running status with sacct) and they are not able to ssh to these newly
added
Hi,
What is the "official" process to remove nodes safely? I have drained the
nodes so jobs are completed and put them in down state after they are
completely drained.
I edited the slurm.conf file to remove the nodes. After some time, I can
see that the nodes were removed from the partition with