HI Tina, Thank you so much for looking at this. slurm 18.08.8 nvidia-smi topo -m !sys GPU0 GPU1 GPU2 GPU3 mlx5_0 CPU Affinity GPU0 X NV2 NV2 NV2 NODE 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42 GPU1 NV2 X NV2 NV2 NODE 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38,40-40,42-42 GPU2 NV2 NV2 X NV2 SYS 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43 GPU3 NV2 NV2 NV2 X SYS 1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23,25-25,27-27,29-29,31-31,33-33,35-35,37-37,39-39,41-41,43-43 mlx5_0 NODE NODE SYS SYS X
I have tried in the gres.conf (without success; only 2 gpu jobs run per node; no cpu jobs are currently running): NodeName=c0005 Name=gpu File=/dev/nvidia0 CPUs=[0,2,4,6,8,10] NodeName=c0005 Name=gpu File=/dev/nvidia1 CPUs=[0,2,4,6,8,10] NodeName=c0005 Name=gpu File=/dev/nvidia2 CPUs=[1,3,5,7,11,13,15,17,29] NodeName=c0005 Name=gpu File=/dev/nvidia3 CPUs=[1,3,5,7,11,13,15,17,29] I also tried your suggetions of 0-13, 14-27, and a combo. I still only get 2 jobs to run on gpus at a time. If I take off the “CPUs=“, I do get 4 jobs running per node. Jodie On Aug 7, 2020, at 12:18 PM, Tina Friedrich <tina.friedr...@it.ox.ac.uk> wrote: Hi Jodie, what version of SLURM are you using? I'm pretty sure newer versions pick the topology up automatically (although I'm on 18.08 so I can't verify that). Is what you're wanting to do - basically - forcefully feed a 'wrong' gres.conf to make SLURM assume all GPUs are on one CPU? (I don't think I've ever tried that!). I have no idea, unfortunately, what CPU SLURM assigns first - it will not (I don't think) assign cores on the non-GPU CPU first (other people please correct me if I'm wrong!). My gres.conf files get written by my config management from the GPU topology, I don't think I've ever written one of them manually. And I've never tried to make them anything wrong, i.e. I've never tried to deliberately give a The GRES conf would probably need to look something like Name=gpu Type=tesla File=/dev/nvidia0 CPUs=0-13 Name=gpu Type=tesla File=/dev/nvidia1 CPUs=0-13 Name=gpu Type=tesla File=/dev/nvidia2 CPUs=0-13 Name=gpu Type=tesla File=/dev/nvidia3 CPUs=0-13 or maybe Name=gpu Type=tesla File=/dev/nvidia0 CPUs=14-27 Name=gpu Type=tesla File=/dev/nvidia1 CPUs=14-27 Name=gpu Type=tesla File=/dev/nvidia2 CPUs=14-27 Name=gpu Type=tesla File=/dev/nvidia3 CPUs=14-27 to 'assign' all GPUs to the first 14 CPUs or second 14 CPUs (your config makes me think there are two 14 core CPUs, so cores 0-13 would probably be CPU1 etc?) (What is the actual topology of the system (according to, say 'nvidia-smi topo -m')?) Tina On 07/08/2020 16:31, Jodie H. Sprouse wrote: > Tina, > Thank you. Yes, jobs will run on all 4 gpus if I submit with: > --gres-flags=disable-binding > Yet my goal is to have the gpus bind to a cpu in order to allow a cpu-only > job to never run on that particular cpu (having it bound to the gpu and > always free for a gpu job) and give the cpu job the maxcpus minus the 4. > > * Hyperthreading is turned on. > NodeName=c000[1-5] Gres=gpu:tesla:4 Boards=1 SocketsPerBoard=2 > CoresPerSocket=14 ThreadsPerCore=2 RealMemory=190000 > > PartitionName=gpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 > MaxTime=168:00:00 State=UP OverSubscribe=NO > TRESBillingWeights="CPU=.25,Mem=0.25G,gres/gpu=2.0" > PartitionName=cpu Nodes=c000[1-5] Default=NO DefaultTime=1:00:00 > MaxTime=168:00:00 State=UP OverSubscribe=NO > TRESBillingWeights="CPU=.25,Mem=0.25G" MaxCPUsPerNode=48 > > I have played tried variations for gres.conf such as: > NodeName=c0005 Name=gpu File=/dev/nvidia[0-1] CPUs=0,2 > NodeName=c0005 Name=gpu File=/dev/nvidia[2-3] CPUs=1,3 > > as well as trying CORES= (rather than CPUSs) with NO success. > > > I’ve battled this all week. Any suggestions would be greatly appreciated! > Thanks for any suggestions! > Jodie > > > On Aug 7, 2020, at 11:12 AM, Tina Friedrich <tina.friedr...@it.ox.ac.uk> > wrote: > > Hello, > > This is something I've seen once on our systems & it took me a while to > figure out what was going on. > > The solution was that the system topology was such that all GPUs were > connected to one CPU. There were no free cores on that particular CPU; so > SLURM did not schedule any more jobs to the GPUs. Needed to disable binding > in job submission to schedule to all of them. > > Not sure that applies in your situation (don't know your system), but it's > something to check? > > Tina > > > On 07/08/2020 15:42, Jodie H. Sprouse wrote: >> Good morning. >> I have having the same experience here. Wondering if you had a resolution? >> Thank you. >> Jodie >> >> >> On Jun 11, 2020, at 3:27 PM, Rhian Resnick <rresn...@fau.edu >> <mailto:rresn...@fau.edu>> wrote: >> >> We have several users submitting single GPU jobs to our cluster. We expected >> the jobs to fill each node and fully utilize the available GPU's but we >> instead find that only 2 out of the 4 gpu's in each node gets allocated. >> >> If we request 2 GPU's in the job and start two jobs, both jobs will start on >> the same node fully allocating the node. We are puzzled about is going on >> and any hints are welcome. >> >> Thanks for your help, >> >> Rhian >> >> >> >> *Example SBATCH Script* >> #!/bin/bash >> #SBATCH --job-name=test >> #SBATCH --partition=longq7-mri >> #SBATCH -N 1 >> #SBATCH -n 1 >> #SBATCH --gres=gpu:1 >> #SBATCH --mail-type=ALL >> hostname >> echo CUDA_VISIBLE_DEVICES $CUDA_VISIBLE_DEVICES >> >> set | grep SLURM >> nvidia-smi >> sleep 500 >> >> >> >> >> *gres.conf* >> #AutoDetect=nvml >> Name=gpu Type=v100 File=/dev/nvidia0 Cores=0 >> Name=gpu Type=v100 File=/dev/nvidia1 Cores=1 >> Name=gpu Type=v100 File=/dev/nvidia2 Cores=2 >> Name=gpu Type=v100 File=/dev/nvidia3 Cores=3 >> >> >> *slurm.conf* >> # >> # Example slurm.conf file. Please run configurator.html >> # (in doc/html) to build a configuration file customized >> # for your environment. >> # >> # >> # slurm.conf file generated by configurator.html. >> # >> # See the slurm.conf man page for more information. >> # >> ClusterName=cluster >> ControlMachine=cluster-slurm1.example.com >> <http://cluster-slurm1.example.com/> >> ControlAddr=10.116.0.11 >> BackupController=cluster-slurm2. >> <http://cluster-slurm2.example.com/>example.com >> <http://cluster-slurm2.example.com/> >> BackupAddr=10.116.0.17 >> # >> SlurmUser=slurm >> #SlurmdUser=root >> SlurmctldPort=6817 >> SlurmdPort=6818 >> SchedulerPort=7321 >> >> RebootProgram="/usr/sbin/reboot" >> >> >> AuthType=auth/munge >> #JobCredentialPrivateKey= >> #JobCredentialPublicCertificate= >> StateSaveLocation=/var/spool/slurm/ctld >> SlurmdSpoolDir=/var/spool/slurm/d >> SwitchType=switch/none >> MpiDefault=none >> SlurmctldPidFile=/var/run/slurmctld.pid >> SlurmdPidFile=/var/run/slurmd.pid >> ProctrackType=proctrack/pgid >> >> GresTypes=gpu,mps,bandwidth >> >> PrologFlags=x11 >> #PluginDir= >> #FirstJobId= >> #MaxJobCount= >> #PlugStackConfig= >> #PropagatePrioProcess= >> #PropagateResourceLimits= >> #PropagateResourceLimitsExcept= >> #Prolog= >> #Epilog=/etc/slurm/slurm.epilog.clean >> #SrunProlog= >> #SrunEpilog= >> #TaskProlog= >> #TaskEpilog= >> #TaskPlugin= >> #TrackWCKey=no >> #TreeWidth=50 >> #TmpFS= >> #UsePAM= >> # >> # TIMERS >> SlurmctldTimeout=300 >> SlurmdTimeout=300 >> InactiveLimit=0 >> MinJobAge=300 >> KillWait=30 >> Waittime=0 >> # >> # SCHEDULING >> SchedulerType=sched/backfill >> #bf_interval=10 >> #SchedulerAuth= >> #SelectType=select/linear >> # Cores and memory are consumable >> #SelectType=select/cons_res >> #SelectTypeParameters=CR_Core_Memory >> SchedulerParameters=bf_interval=10 >> SelectType=select/cons_res >> SelectTypeParameters=CR_Core >> >> FastSchedule=1 >> #PriorityType=priority/multifactor >> #PriorityDecayHalfLife=14-0 >> #PriorityUsageResetPeriod=14-0 >> #PriorityWeightFairshare=100000 >> #PriorityWeightAge=1000 >> #PriorityWeightPartition=10000 >> #PriorityWeightJobSize=1000 >> #PriorityMaxAge=1-0 >> # >> # LOGGING >> SlurmctldDebug=3 >> SlurmctldLogFile=/var/log/slurmctld.log >> SlurmdDebug=3 >> SlurmdLogFile=/var/log/slurmd.log >> JobCompType=jobcomp/none >> #JobCompLoc= >> # >> # ACCOUNTING >> #JobAcctGatherType=jobacct_gather/linux >> #JobAcctGatherFrequency=30 >> # >> #AccountingStorageType=accounting_storage/slurmdbd >> #AccountingStorageHost= >> #AccountingStorageLoc= >> #AccountingStoragePass= >> #AccountingStorageUser= >> # >> # >> # >> # Default values >> # DefMemPerNode=64000 >> # DefCpuPerGPU=4 >> # DefMemPerCPU=4000 >> # DefMemPerGPU=16000 >> >> >> >> # OpenHPC default configuration >> #TaskPlugin=task/affinity >> TaskPlugin=task/affinity,task/cgroup >> PropagateResourceLimitsExcept=MEMLOCK >> TaskPluginParam=autobind=cores >> #AccountingStorageType=accounting_storage/mysql >> #StorageLoc=slurm_acct_db >> >> AccountingStorageType=accounting_storage/slurmdbd >> AccountingStorageHost=cluster-slurmdbd1.example.com >> <http://cluster-slurmdbd1.example.com/> >> #AccountingStorageType=accounting_storage/filetxt >> Epilog=/etc/slurm/slurm.epilog.clean >> >> >> #PartitionName=normal Nodes=c[1-5] Default=YES MaxTime=24:00:00 State=UP >> PartitionName=DEFAULT State=UP Default=NO AllowGroups=ALL Priority=10 >> DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 >> PreemptMode=OFF ReqResv=NO AllowAccounts=ALL AllowQos=ALL LLN=NO >> ExclusiveUser=NO Nodes=nodeamd[009-016],c[1-4],nodehtc[001-025] >> >> >> # Partitions >> >> # Group Limited Queues >> >> # OIT DEBUG QUEUE >> PartitionName=debug Nodes=c[1-4] MaxTime=24:00:00 State=UP >> AllowGroups=oit-hpc-admin >> >> # RNA CHEM >> PartitionName=longq7-rna MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 >> MaxTime=UNLIMITED Priority=200 Nodes=nodeamd[001-008],nodegpu[021-025] >> AllowGroups=gpu-rnachem >> >> # V100's >> PartitionName=longq7-mri MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 >> MaxTime=168:00:00 Priority=200 Nodes=nodenviv100[001-016] AllowGroups=gpu-mri >> >> # BIGDATA GRANT >> PartitionName=longq-bigdata7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 >> MaxTime=168:00:00 Priority=200 Nodes=node[087-098],nodegpu001 >> AllowGroups=fau-bigdata,nsf-bigdata >> >> PartitionName=gpu-bigdata7 Default=NO MinNodes=1 Priority=10 >> AllowAccounts=ALL Nodes=nodegpu001 AllowGroups=fau-bigdata,nsf-bigdata >> >> # CogNeuroLab >> PartitionName=CogNeuroLab Default=NO MinNodes=1 MaxNodes=4 >> MaxTime=7-12:00:00 AllowGroups=cogneurolab Priority=200 State=UP >> Nodes=node[001-004] >> >> >> # Standard queues >> >> # OPEN TO ALL >> >> #Short Queue >> PartitionName=shortq7 MinNodes=1 MaxNodes=30 DefaultTime=06:00:00 >> MaxTime=06:00:00 Priority=100 >> Nodes=nodeamd[001-016],nodenviv100[001-015],nodegpu[001-025],node[001-100],nodehtc[001-025] >> Default=YES >> >> # Medium Queue >> PartitionName=mediumq7 MinNodes=1 MaxNodes=30 DefaultTime=72:00:00 >> MaxTime=72:00:00 Priority=50 Nodes=nodeamd[009-016],node[004-100] >> >> # Long Queue >> PartitionName=longq7 MinNodes=1 MaxNodes=30 DefaultTime=168:00:00 >> MaxTime=168:00:00 Priority=30 Nodes=nodeamd[009-016],node[004-100] >> >> >> # Interactive >> PartitionName=interactive MinNodes=1 MaxNodes=4 DefaultTime=06:00:00 >> MaxTime=06:00:00 Priority=101 Nodes=node[001-100] Default=No Hidden=YES >> >> # Nodes >> >> # Test nodes, (vms) >> NodeName=c[1-4] Cpus=4 Feature=virtual RealMemory=16000 >> >> # AMD Nodes >> NodeName=nodeamd[001-016] Procs=64 Boards=1 SocketsPerBoard=8 >> CoresPerSocket=8 ThreadsPerCore=1 Features=amd,epyc RealMemory=225436 >> >> # V100 MRI >> NodeName=nodenviv100[001-016] CPUs=64 Boards=1 SocketsPerBoard=2 >> CoresPerSocket=16 ThreadsPerCore=2 Gres=gpu:v100:4 Feature=v100 >> RealMemory=192006 >> >> # GPU nodes >> NodeName=nodegpu001 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >> ThreadsPerCore=2 Gres=gpu:k80:8 Feature=k80,intel RealMemory=64000 >> NodeName=nodegpu002 Procs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >> ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000 >> NodeName=nodegpu[003-020] Boards=1 SocketsPerBoard=2 CoresPerSocket=8 >> ThreadsPerCore=2 Gres=gpu:gk1:8 Feature=gk1,intel RealMemory=128000 >> NodeName=nodegpu[021-025] Procs=16 Boards=1 SocketsPerBoard=2 >> CoresPerSocket=8 ThreadsPerCore=1 Gres=gpu:4 Feature=exxact,intel >> RealMemory=128000 >> >> # IvyBridge nodes >> NodeName=node[001-021] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >> ThreadsPerCore=1 Feature=intel,ivybridge RealMemory=112750 >> # SandyBridge node(2) >> NodeName=node022 Procs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 >> ThreadsPerCore=1 Feature=intel,sandybridge RealMemory=64000 >> # IvyBridge >> NodeName=node[023-050] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >> ThreadsPerCore=1 Feature=intel,ivybridge RealMemory=112750 >> # Haswell >> NodeName=node[051-100] Procs=20 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >> ThreadsPerCore=1 Feature=intel,haswell RealMemory=112750 >> >> >> # Node health monitoring >> HealthCheckProgram=/usr/sbin/nhc >> HealthCheckInterval=300 >> ReturnToService=2 >> >> # Fix for X11 issues >> X11Parameters=use_raw_hostname >> >> >> >> Rhian Resnick >> Associate Director Research Computing >> Enterprise Systems >> Office of Information Technology >> >> Florida Atlantic University >> 777 Glades Road, CM22, Rm 173B >> Boca Raton, FL 33431 >> Phone 561.297.2647 >> Fax 561.297.0222 >> >