Re: [slurm-users] GPU jobs not running correctly

2021-08-20 Thread Andrey Malyutin
Thank you for your help, Sam! The rest of the slurm.conf, excluding the node and partition configuration from the earlier email is below. I've also included scontrol output for a 1 GPU job that runs successfully on node01. Best, Andrey *Slurm.conf* # # See the slurm.conf man page for more infor

Re: [slurm-users] GPU jobs not running correctly

2021-08-20 Thread Fulcomer, Samuel
...and I'm not sure what "AutoDetect=NVML" is supposed to do in the gres.conf file. We've always used "nvidia-smi topo -m" to confirm that we've got a single-root or dual-root node and have entered the correct info in gres.conf to map connections to the CPU sockets, e.g.: # 8-gpu A6000 nodes -

Re: [slurm-users] GPU jobs not running correctly

2021-08-20 Thread Fulcomer, Samuel
Well... you've got lots of weirdness, as the scontrol show job command isn't listing any GPU TRES requests, and the scontrol show node command isn't listing any configured GPU TRES resources. If you send me your entire slurm.conf I'll have a quick look-over. You also should be using cgroup.conf t

Re: [slurm-users] GPU jobs not running correctly

2021-08-20 Thread Andrey Malyutin
Thank you Samuel, Slurm version is 20.02.6. I'm not entirely sure about the platform, RTX6000 nodes are about 2 years old, and 3090 node is very recent. Technically we have 4 nodes (hence references to node04 in info below), but one of the nodes is down and out of the system at the moment. As you

[slurm-users] Instantaneous CPU core, GPU utilization report

2021-08-20 Thread Hemanta Sahu
Hello , Is there any commands to display instant cluster utilization in terms of CPU-cores and GPU/GRES I found something as below for CPU cores but could not find anything for GPU/GRES $ sinfo -O cpusstate,partition --partition=testpart CPUS(A/I/O/T) PARTITION 415/465/0/880 testp

Re: [slurm-users] [External] jobs stuck in "CG" state

2021-08-20 Thread Florian Zillner
Hi, scancel the job, then set the nodes to a "down" state like so "scontrol update nodename= state=down reason=cg" and resume them afterwards. However, if there are tasks stuck, then in most cases a reboot is needed to bring the node back with in a clean state. Best, Florian ___

Re: [slurm-users] Preemption not working for jobs in higher priority partition

2021-08-20 Thread Russell Jones
I could have swore I had tested this before implementing it and it worked as expected. If I am dreaming that testing - is there a way of allowing preemption across partitions? On Fri, Aug 20, 2021 at 8:40 AM Brian Andrus wrote: > IIRC, Preemption is determined by partition first, not node. > >

Re: [slurm-users] PrivateData does not filter the billing info "scontrol show assoc_mgr flags=qos"

2021-08-20 Thread Hemanta Sahu
of a cluster with 3 > nodes, > > 4 GPUs each. One node has RTX3090s and the other 2 have RTX6000s.Any job > > asking for 1 GPU in the submission script will wait to run on the 3090 > > node, no matter resource availability. Same job requesting 2 or more GPUs > > will run on any node. I don't even know where to begin troubleshooting > this > > issue; entries for the 3 nodes are effectively identical in slurm.conf. > Any > > help would be appreciated. (If helpful - this cluster is used for > > structural biology, with cryosparc and relion packages). > > > > Thank you, > > Andrey > > > -- next part -- > An HTML attachment was scrubbed... > URL: < > http://lists.schedmd.com/pipermail/slurm-users/attachments/20210819/4e2636a0/attachment-0001.htm > > > > -- > > Message: 5 > Date: Fri, 20 Aug 2021 10:31:40 +0200 > From: Durai Arasan > To: Slurm User Community List > Subject: [slurm-users] jobs stuck in "CG" state > Message-ID: > cr7cwlepw...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hello! > > We have a huge number of jobs stuck in CG state from a user who probably > wrote code with bad I/O. "scancel" does not make them go away. Is there a > way for admins to get rid of these jobs without draining and rebooting the > nodes. I read somewhere that killing the respective slurmstepd process will > do the job. Is this possible? Any other solutions? Also are there any > parameters in slurm.conf one can set to manage such situations better? > > Best, > Durai > MPI T?bingen > -- next part -- > An HTML attachment was scrubbed... > URL: < > http://lists.schedmd.com/pipermail/slurm-users/attachments/20210820/f34971c1/attachment.htm > > > > End of slurm-users Digest, Vol 46, Issue 20 > *** >

Re: [slurm-users] Preemption not working for jobs in higher priority partition

2021-08-20 Thread Brian Andrus
IIRC, Preemption is determined by partition first, not node. Since your pending job is in the 'day' partition, it will not preempt something in the 'night' partition (even if the node is in both). Brian Andrus On 8/19/2021 2:49 PM, Russell Jones wrote: Hi all, I could use some help to under

[slurm-users] jobs stuck in "CG" state

2021-08-20 Thread Durai Arasan
Hello! We have a huge number of jobs stuck in CG state from a user who probably wrote code with bad I/O. "scancel" does not make them go away. Is there a way for admins to get rid of these jobs without draining and rebooting the nodes. I read somewhere that killing the respective slurmstepd proces