Thank you for your help, Sam! The rest of the slurm.conf, excluding the node and partition configuration from the earlier email is below. I've also included scontrol output for a 1 GPU job that runs successfully on node01.
Best, Andrey *Slurm.conf* # # See the slurm.conf man page for more information. # SlurmUser=slurm #SlurmdUser=root SlurmctldPort=6817 SlurmdPort=6818 AuthType=auth/munge #JobCredentialPrivateKey= #JobCredentialPublicCertificate= SlurmdSpoolDir=/cm/local/apps/slurm/var/spool SwitchType=switch/none MpiDefault=none SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid #ProctrackType=proctrack/pgid ProctrackType=proctrack/cgroup #PluginDir= CacheGroups=0 #FirstJobId= ReturnToService=2 #MaxJobCount= #PlugStackConfig= #PropagatePrioProcess= #PropagateResourceLimits= #PropagateResourceLimitsExcept= #SrunProlog= #SrunEpilog= #TaskProlog= #TaskEpilog= TaskPlugin=task/cgroup #TrackWCKey=no #TreeWidth=50 #TmpFs= #UsePAM= # # TIMERS SlurmctldTimeout=300 SlurmdTimeout=300 InactiveLimit=0 MinJobAge=300 KillWait=30 Waittime=0 # # SCHEDULING #SchedulerAuth= #SchedulerPort= #SchedulerRootFilter= #PriorityType=priority/multifactor #PriorityDecayHalfLife=14-0 #PriorityUsageResetPeriod=14-0 #PriorityWeightFairshare=100000 #PriorityWeightAge=1000 #PriorityWeightPartition=10000 #PriorityWeightJobSize=1000 #PriorityMaxAge=1-0 # # LOGGING SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurmctld SlurmdDebug=3 SlurmdLogFile=/var/log/slurmd #JobCompType=jobcomp/filetxt #JobCompLoc=/cm/local/apps/slurm/var/spool/job_comp.log # # ACCOUNTING JobAcctGatherType=jobacct_gather/linux #JobAcctGatherType=jobacct_gather/cgroup #JobAcctGatherFrequency=30 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm # AccountingStorageLoc=slurm_acct_db # AccountingStoragePass=SLURMDBD_USERPASS # Scheduler SchedulerType=sched/backfill # Statesave StateSaveLocation=/cm/shared/apps/slurm/var/cm/statesave/slurm # Generic resources types GresTypes=gpu # Epilog/Prolog section PrologSlurmctld=/cm/local/apps/cmd/scripts/prolog-prejob Prolog=/cm/local/apps/cmd/scripts/prolog Epilog=/cm/local/apps/cmd/scripts/epilog # Power saving section (disabled) # GPU related plugins #SelectType=select/cons_tres #SelectTypeParameters=CR_Core #AccountingStorageTRES=gres/gpu # END AUTOGENERATED SECTION -- DO NOT REMOVE *Scontrol for working 1GPU job on node01* JobId=285 JobName=cryosparc_P2_J232 UserId=cryosparc(1003) GroupId=cryosparc(1003) MCS_label=N/A Priority=4294901570 Nice=0 Account=(null) QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:51 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2021-08-21T00:05:30 EligibleTime=2021-08-21T00:05:30 AccrueTime=2021-08-21T00:05:30 StartTime=2021-08-21T00:05:30 EndTime=Unknown Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-21T00:05:30 Partition=CSLive AllocNode:Sid=headnode:108964 ReqNodeList=(null) ExcNodeList=(null) NodeList=node01 BatchHost=node01 NumNodes=1 NumCPUs=64 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=64,node=1,billing=64 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=24000M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/data/backups/takeda2/data/cryosparc_projects/P8/J232/queue_sub_script.sh WorkDir=/ssd/CryoSparc/cryosparc_master StdErr=/data/backups/takeda2/data/cryosparc_projects/P8/J232/job.log StdIn=/dev/null StdOut=/data/backups/takeda2/data/cryosparc_projects/P8/J232/job.log Power= TresPerNode=gpu:1 MailUser=cryosparc MailType=NONE *Cgroup* # This section of this file was automatically generated by cmd. Do not edit manually! # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE CgroupMountpoint="/sys/fs/cgroup" CgroupAutomount=no TaskAffinity=no ConstrainCores=no ConstrainRAMSpace=no ConstrainSwapSpace=no ConstrainDevices=no ConstrainKmemSpace=yes AllowedRamSpace=100.00 AllowedSwapSpace=0.00 MinKmemSpace=30 MaxKmemPercent=100.00 MaxRAMPercent=100.00 MaxSwapPercent=100.00 MinRAMSpace=30 On Fri, Aug 20, 2021 at 3:12 PM Fulcomer, Samuel <samuel_fulco...@brown.edu> wrote: > ...and I'm not sure what "AutoDetect=NVML" is supposed to do in the > gres.conf file. We've always used "nvidia-smi topo -m" to confirm that > we've got a single-root or dual-root node and have entered the correct info > in gres.conf to map connections to the CPU sockets...., e.g.: > > # 8-gpu A6000 nodes - dual-root > NodeName=gpu[1504-1506] Name=gpu Type=a6000 File=/dev/nvidia[0-3] CPUs=0-23 > NodeName=gpu[1504-1506] Name=gpu Type=a6000 File=/dev/nvidia[4-7] > CPUs=24-47 > > > > > > On Fri, Aug 20, 2021 at 6:01 PM Fulcomer, Samuel < > samuel_fulco...@brown.edu> wrote: > >> Well... you've got lots of weirdness, as the scontrol show job command >> isn't listing any GPU TRES requests, and the scontrol show node command >> isn't listing any configured GPU TRES resources. >> >> If you send me your entire slurm.conf I'll have a quick look-over. >> >> You also should be using cgroup.conf to fence off the GPU devices so that >> a job only sees the GPUs that it's been allocated. The lines in the batch >> file to figure it out aren't necessary. I forgot to ask you about >> cgroup.conf. >> >> regards, >> Sam >> >> On Fri, Aug 20, 2021 at 5:46 PM Andrey Malyutin <malyuti...@gmail.com> >> wrote: >> >>> Thank you Samuel, >>> >>> Slurm version is 20.02.6. I'm not entirely sure about the platform, >>> RTX6000 nodes are about 2 years old, and 3090 node is very recent. >>> Technically we have 4 nodes (hence references to node04 in info below), but >>> one of the nodes is down and out of the system at the moment. As you see, >>> the job really wants to run on the downed node instead of going to node02 >>> or node03. >>> >>> Thank you again, >>> Andrey >>> >>> >>> >>> *scontrol info:* >>> >>> JobId=283 JobName=cryosparc_P2_J214 >>> >>> UserId=cryosparc(1003) GroupId=cryosparc(1003) MCS_label=N/A >>> >>> Priority=4294901572 Nice=0 Account=(null) QOS=normal >>> >>> JobState=PENDING Reason=ReqNodeNotAvail,_UnavailableNodes:node04 >>> Dependency=(null) >>> >>> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 >>> >>> RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A >>> >>> SubmitTime=2021-08-20T20:55:00 EligibleTime=2021-08-20T20:55:00 >>> >>> AccrueTime=2021-08-20T20:55:00 >>> >>> StartTime=Unknown EndTime=Unknown Deadline=N/A >>> >>> SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-20T23:36:14 >>> >>> Partition=CSCluster AllocNode:Sid=headnode:108964 >>> >>> ReqNodeList=(null) ExcNodeList=(null) >>> >>> NodeList=(null) >>> >>> NumNodes=1 NumCPUs=4 NumTasks=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:* >>> >>> TRES=cpu=4,mem=24000M,node=1,billing=4 >>> >>> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* >>> >>> MinCPUsNode=1 MinMemoryNode=24000M MinTmpDiskNode=0 >>> >>> Features=(null) DelayBoot=00:00:00 >>> >>> OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) >>> >>> >>> Command=/data/backups/takeda2/data/cryosparc_projects/P8/J214/queue_sub_script.sh >>> >>> WorkDir=/ssd/CryoSparc/cryosparc_master >>> >>> StdErr=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log >>> >>> StdIn=/dev/null >>> >>> StdOut=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log >>> >>> Power= >>> >>> TresPerNode=gpu:1 >>> >>> MailUser=cryosparc MailType=NONE >>> >>> >>> *Script:* >>> >>> #SBATCH --job-name cryosparc_P2_J214 >>> >>> #SBATCH -n 4 >>> >>> #SBATCH --gres=gpu:1 >>> >>> #SBATCH -p CSCluster >>> >>> #SBATCH --mem=24000MB >>> >>> #SBATCH >>> --output=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log >>> >>> #SBATCH >>> --error=/data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log >>> >>> >>> >>> available_devs="" >>> >>> for devidx in $(seq 0 15); >>> >>> do >>> >>> if [[ -z $(nvidia-smi -i $devidx --query-compute-apps=pid >>> --format=csv,noheader) ]] ; then >>> >>> if [[ -z "$available_devs" ]] ; then >>> >>> available_devs=$devidx >>> >>> else >>> >>> available_devs=$available_devs,$devidx >>> >>> fi >>> >>> fi >>> >>> done >>> >>> export CUDA_VISIBLE_DEVICES=$available_devs >>> >>> >>> >>> /ssd/CryoSparc/cryosparc_worker/bin/cryosparcw run --project P2 --job >>> J214 --master_hostname headnode.cm.cluster --master_command_core_port 39002 >>> > /data/backups/takeda2/data/cryosparc_projects/P8/J214/job.log 2>&1 >>> >>> >>> >>> >>> >>> >>> >>> *Slurm.conf* >>> >>> # This section of this file was automatically generated by cmd. Do not >>> edit manually! >>> >>> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE >>> >>> # Server nodes >>> >>> SlurmctldHost=headnode >>> >>> AccountingStorageHost=master >>> >>> >>> ############################################################################################# >>> >>> #GPU Nodes >>> >>> >>> ############################################################################################# >>> >>> NodeName=node[02-04] Procs=64 CoresPerSocket=16 RealMemory=257024 >>> Sockets=2 ThreadsPerCore=2 Feature=RTX6000 Gres=gpu:4 >>> >>> NodeName=node01 Procs=64 CoresPerSocket=16 RealMemory=386048 Sockets=2 >>> ThreadsPerCore=2 Feature=RTX3090 Gres=gpu:4 >>> >>> #NodeName=node[05-08] Procs=8 Gres=gpu:4 >>> >>> # >>> >>> >>> ############################################################################################# >>> >>> # Partitions >>> >>> >>> ############################################################################################# >>> >>> PartitionName=defq Default=YES MinNodes=1 DefaultTime=UNLIMITED >>> MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 >>> OverSubscribe=NO PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL >>> Nodes=node[01-04] >>> >>> PartitionName=CSLive MinNodes=1 DefaultTime=UNLIMITED MaxTime=UNLIMITED >>> AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 OverSubscribe=NO >>> PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL Nodes=node01 >>> >>> PartitionName=CSCluster MinNodes=1 DefaultTime=UNLIMITED >>> MaxTime=UNLIMITED AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 >>> OverSubscribe=NO PreemptMode=OFF AllowAccounts=ALL AllowQos=ALL >>> Nodes=node[02-04] >>> >>> ClusterName=slurm >>> >>> >>> >>> *Gres.conf* >>> >>> # This section of this file was automatically generated by cmd. Do not >>> edit manually! >>> >>> # BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE >>> >>> AutoDetect=NVML >>> >>> # END AUTOGENERATED SECTION -- DO NOT REMOVE >>> >>> #Name=gpu File=/dev/nvidia[0-3] Count=4 >>> >>> #Name=mic Count=0 >>> >>> >>> >>> *Sinfo:* >>> >>> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >>> >>> defq* up infinite 1 down* node04 >>> >>> defq* up infinite 3 idle node[01-03] >>> >>> CSLive up infinite 1 idle node01 >>> >>> CSCluster up infinite 1 down* node04 >>> >>> CSCluster up infinite 2 idle node[02-03] >>> >>> >>> >>> *Node1:* >>> >>> NodeName=node01 Arch=x86_64 CoresPerSocket=16 >>> >>> CPUAlloc=0 CPUTot=64 CPULoad=0.04 >>> >>> AvailableFeatures=RTX3090 >>> >>> ActiveFeatures=RTX3090 >>> >>> Gres=gpu:4 >>> >>> NodeAddr=node01 NodeHostName=node01 Version=20.02.6 >>> >>> OS=Linux 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC >>> 2020 >>> >>> RealMemory=386048 AllocMem=0 FreeMem=16665 Sockets=2 Boards=1 >>> >>> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A >>> >>> Partitions=defq,CSLive >>> >>> BootTime=2021-08-04T13:59:08 SlurmdStartTime=2021-08-10T09:32:43 >>> >>> CfgTRES=cpu=64,mem=377G,billing=64 >>> >>> AllocTRES= >>> >>> CapWatts=n/a >>> >>> CurrentWatts=0 AveWatts=0 >>> >>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >>> >>> >>> >>> *Node2-3* >>> >>> NodeName=node02 Arch=x86_64 CoresPerSocket=16 >>> >>> CPUAlloc=0 CPUTot=64 CPULoad=0.48 >>> >>> AvailableFeatures=RTX6000 >>> >>> ActiveFeatures=RTX6000 >>> >>> Gres=gpu:4(S:0-1) >>> >>> NodeAddr=node02 NodeHostName=node02 Version=20.02.6 >>> >>> OS=Linux 3.10.0-1160.11.1.el7.x86_64 #1 SMP Fri Dec 18 16:34:56 UTC >>> 2020 >>> >>> RealMemory=257024 AllocMem=0 FreeMem=2259 Sockets=2 Boards=1 >>> >>> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A >>> >>> Partitions=defq,CSCluster >>> >>> BootTime=2021-07-29T20:47:32 SlurmdStartTime=2021-08-10T09:32:55 >>> >>> CfgTRES=cpu=64,mem=251G,billing=64 >>> >>> AllocTRES= >>> >>> CapWatts=n/a >>> >>> CurrentWatts=0 AveWatts=0 >>> >>> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >>> >>> On Thu, Aug 19, 2021, 6:07 PM Fulcomer, Samuel < >>> samuel_fulco...@brown.edu> wrote: >>> >>>> What SLURM version are you running? >>>> >>>> What are the #SLURM directives in the batch script? (or the sbatch >>>> arguments) >>>> >>>> When the single GPU jobs are pending, what's the output of 'scontrol >>>> show job JOBID'? >>>> >>>> What are the node definitions in slurm.conf, and the lines in gres.conf? >>>> >>>> Are the nodes all the same host platform (motherboard)? >>>> >>>> We have P100s, TitanVs, Titan RTXs, Quadro RTX 6000s, 3090s, V100s, DGX >>>> 1s, A6000s, and A40s, with a mix of single and dual-root platforms, and >>>> haven't seen this problem with SLURM 20.02.6 or earlier versions. >>>> >>>> On Thu, Aug 19, 2021 at 8:38 PM Andrey Malyutin <malyuti...@gmail.com> >>>> wrote: >>>> >>>>> Hello, >>>>> >>>>> We are in the process of finishing up the setup of a cluster with 3 >>>>> nodes, 4 GPUs each. One node has RTX3090s and the other 2 have >>>>> RTX6000s.Any >>>>> job asking for 1 GPU in the submission script will wait to run on the 3090 >>>>> node, no matter resource availability. Same job requesting 2 or more GPUs >>>>> will run on any node. I don't even know where to begin troubleshooting >>>>> this >>>>> issue; entries for the 3 nodes are effectively identical in slurm.conf. >>>>> Any >>>>> help would be appreciated. (If helpful - this cluster is used for >>>>> structural biology, with cryosparc and relion packages). >>>>> >>>>> Thank you, >>>>> Andrey >>>>> >>>>