Thank you for your help, Sam! The rest of the slurm.conf, excluding the
node and partition configuration from the earlier email is below. I've also
included scontrol output for a 1 GPU job that runs successfully on node01.
Best,
Andrey
*Slurm.conf*
#
# See the slurm.conf man page for more infor
...and I'm not sure what "AutoDetect=NVML" is supposed to do in the
gres.conf file. We've always used "nvidia-smi topo -m" to confirm that
we've got a single-root or dual-root node and have entered the correct info
in gres.conf to map connections to the CPU sockets, e.g.:
# 8-gpu A6000 nodes -
Well... you've got lots of weirdness, as the scontrol show job command
isn't listing any GPU TRES requests, and the scontrol show node command
isn't listing any configured GPU TRES resources.
If you send me your entire slurm.conf I'll have a quick look-over.
You also should be using cgroup.conf t
Thank you Samuel,
Slurm version is 20.02.6. I'm not entirely sure about the platform, RTX6000
nodes are about 2 years old, and 3090 node is very recent. Technically we
have 4 nodes (hence references to node04 in info below), but one of the
nodes is down and out of the system at the moment. As you
Hello ,
Is there any commands to display instant cluster utilization in terms of
CPU-cores and GPU/GRES
I found something as below for CPU cores but could not find anything for
GPU/GRES
$ sinfo -O cpusstate,partition --partition=testpart
CPUS(A/I/O/T) PARTITION
415/465/0/880 testp
Hi,
scancel the job, then set the nodes to a "down" state like so "scontrol update
nodename= state=down reason=cg" and resume them afterwards.
However, if there are tasks stuck, then in most cases a reboot is needed to
bring the node back with in a clean state.
Best,
Florian
___
I could have swore I had tested this before implementing it and it worked
as expected.
If I am dreaming that testing - is there a way of allowing preemption
across partitions?
On Fri, Aug 20, 2021 at 8:40 AM Brian Andrus wrote:
> IIRC, Preemption is determined by partition first, not node.
>
>
of a cluster with 3
> nodes,
> > 4 GPUs each. One node has RTX3090s and the other 2 have RTX6000s.Any job
> > asking for 1 GPU in the submission script will wait to run on the 3090
> > node, no matter resource availability. Same job requesting 2 or more GPUs
> > will run on any node. I don't even know where to begin troubleshooting
> this
> > issue; entries for the 3 nodes are effectively identical in slurm.conf.
> Any
> > help would be appreciated. (If helpful - this cluster is used for
> > structural biology, with cryosparc and relion packages).
> >
> > Thank you,
> > Andrey
> >
> -- next part --
> An HTML attachment was scrubbed...
> URL: <
> http://lists.schedmd.com/pipermail/slurm-users/attachments/20210819/4e2636a0/attachment-0001.htm
> >
>
> --
>
> Message: 5
> Date: Fri, 20 Aug 2021 10:31:40 +0200
> From: Durai Arasan
> To: Slurm User Community List
> Subject: [slurm-users] jobs stuck in "CG" state
> Message-ID:
> cr7cwlepw...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hello!
>
> We have a huge number of jobs stuck in CG state from a user who probably
> wrote code with bad I/O. "scancel" does not make them go away. Is there a
> way for admins to get rid of these jobs without draining and rebooting the
> nodes. I read somewhere that killing the respective slurmstepd process will
> do the job. Is this possible? Any other solutions? Also are there any
> parameters in slurm.conf one can set to manage such situations better?
>
> Best,
> Durai
> MPI T?bingen
> -- next part --
> An HTML attachment was scrubbed...
> URL: <
> http://lists.schedmd.com/pipermail/slurm-users/attachments/20210820/f34971c1/attachment.htm
> >
>
> End of slurm-users Digest, Vol 46, Issue 20
> ***
>
IIRC, Preemption is determined by partition first, not node.
Since your pending job is in the 'day' partition, it will not preempt
something in the 'night' partition (even if the node is in both).
Brian Andrus
On 8/19/2021 2:49 PM, Russell Jones wrote:
Hi all,
I could use some help to under
Hello!
We have a huge number of jobs stuck in CG state from a user who probably
wrote code with bad I/O. "scancel" does not make them go away. Is there a
way for admins to get rid of these jobs without draining and rebooting the
nodes. I read somewhere that killing the respective slurmstepd proces
10 matches
Mail list logo